Readme
Gemma 4 26B-A4B
Multimodal reasoning model for text, images, and video.
This Replicate endpoint serves an optimized version of Gemma 4 26B-A4B, a 26B-parameter vision-language MoE (Mixture of Experts) model from Google DeepMind designed for instruction following, reasoning, coding, document understanding, and agent-style workflows.
Compared with the original Hugging Face model card, this page focuses on the hosted experience: fast access to a production-ready version of the model without the self-hosting setup.
What it does
Gemma 4 26B-A4B is a general-purpose multimodal model that can:
- Answer questions about text, images, and video
- Reason over diagrams, charts, and visual documents
- Follow complex instructions with native system prompt support
- Perform coding and agent-style tasks with built-in function calling
- Handle long-context workloads up to 256K tokens
- Work across 140+ languages
- Think step-by-step before answering (configurable thinking mode)
Why use this model
Gemma 4 26B-A4B combines strong language reasoning with native multimodal understanding in an efficient MoE architecture — only 4B parameters are active per token out of 26B total, giving near-4B-model speed with much larger model quality. It is well suited for:
- Visual question answering
- Document parsing, OCR, and data extraction
- Coding and technical assistance
- Multilingual assistants
- Long-context analysis (up to 256K tokens)
- Agentic applications with tool use and function calling
- Reasoning-heavy product features
Highlights
- Efficient MoE architecture: 128 experts with 8 active per token — runs almost as fast as a 4B model while leveraging 26B total parameters
- Built-in thinking mode: configurable step-by-step reasoning before answering
- Native function calling: designed for agentic and tool-calling workflows
- Variable image resolution: configurable visual token budget (70 to 1120 tokens per image) for balancing detail vs speed
- 256K context window: native support for very long inputs
- Native system prompt support: structured and controllable conversations via the
systemrole - Broad language coverage: pre-trained on 140+ languages with strong multilingual performance
- Apache 2.0 license: permissive open-source license
Model details
| Property | Value |
|---|---|
| Model | google/gemma-4-26B-A4B-it |
| Architecture | Causal language model with vision encoder (MoE) |
| Total parameters | 25.2B |
| Active parameters | ~3.8B per token |
| Experts | 128 total, 8 active + 1 shared |
| Layers | 30 |
| Context length | 262,144 tokens |
| Modality support | Text, images, video |
| License | Apache 2.0 |
Performance overview
Gemma 4 26B-A4B delivers strong results across a wide range of benchmarks:
| Benchmark | Gemma 4 26B-A4B | Gemma 3 27B |
|---|---|---|
| MMLU Pro | 82.6% | 67.6% |
| AIME 2026 (no tools) | 88.3% | 20.8% |
| LiveCodeBench v6 | 77.1% | 29.1% |
| GPQA Diamond | 82.3% | 42.4% |
| MMMLU | 86.3% | 70.7% |
| MMMU Pro (vision) | 73.8% | 49.7% |
| MATH-Vision | 82.4% | 46.0% |
Best use cases
Use this model when you need a single endpoint that can handle:
- Chat with image or video input
- Screenshot or UI understanding
- OCR and document Q&A
- Diagram and chart comprehension
- Multilingual assistants
- Reasoning-heavy product features (with thinking mode)
- Agent pipelines that mix perception and action
- Function calling and tool use
Notes
- When thinking mode is enabled, the model reasons internally before responding. This generally improves answer quality at the cost of additional output tokens.
- For best results with multimodal inputs, the model expects images and video before text in the prompt. This endpoint handles that ordering automatically.
- Recommended sampling parameters from Google:
temperature=1.0,top_p=0.95,top_k=64. - Because this is a hosted and optimized Replicate deployment, behavior and latency may differ from raw self-hosted Hugging Face checkpoints.
Limitations
Like other large multimodal models, Gemma 4 26B-A4B can still:
- Hallucinate facts or visual details
- Make mistakes on fine-grained counting or localization
- Underperform on highly domain-specific inputs without careful prompting
- Produce variable outputs across languages and long contexts
- Struggle with subtle nuance, sarcasm, or figurative language
Human review is recommended for high-stakes use cases.