Readme

Gemma 4 26B-A4B

Multimodal reasoning model for text, images, and video.

This Replicate endpoint serves an optimized version of Gemma 4 26B-A4B, a 26B-parameter vision-language MoE (Mixture of Experts) model from Google DeepMind designed for instruction following, reasoning, coding, document understanding, and agent-style workflows.

Compared with the original Hugging Face model card, this page focuses on the hosted experience: fast access to a production-ready version of the model without the self-hosting setup.

What it does

Gemma 4 26B-A4B is a general-purpose multimodal model that can:

Answer questions about text, images, and video
Reason over diagrams, charts, and visual documents
Follow complex instructions with native system prompt support
Perform coding and agent-style tasks with built-in function calling
Handle long-context workloads up to 256K tokens
Work across 140+ languages
Think step-by-step before answering (configurable thinking mode)

Why use this model

Gemma 4 26B-A4B combines strong language reasoning with native multimodal understanding in an efficient MoE architecture — only 4B parameters are active per token out of 26B total, giving near-4B-model speed with much larger model quality. It is well suited for:

Visual question answering
Document parsing, OCR, and data extraction
Coding and technical assistance
Multilingual assistants
Long-context analysis (up to 256K tokens)
Agentic applications with tool use and function calling
Reasoning-heavy product features

Highlights

Efficient MoE architecture: 128 experts with 8 active per token — runs almost as fast as a 4B model while leveraging 26B total parameters
Built-in thinking mode: configurable step-by-step reasoning before answering
Native function calling: designed for agentic and tool-calling workflows
Variable image resolution: configurable visual token budget (70 to 1120 tokens per image) for balancing detail vs speed
256K context window: native support for very long inputs
Native system prompt support: structured and controllable conversations via the system role
Broad language coverage: pre-trained on 140+ languages with strong multilingual performance
Apache 2.0 license: permissive open-source license

Model details

Property	Value
Model	google/gemma-4-26B-A4B-it
Architecture	Causal language model with vision encoder (MoE)
Total parameters	25.2B
Active parameters	~3.8B per token
Experts	128 total, 8 active + 1 shared
Layers	30
Context length	262,144 tokens
Modality support	Text, images, video
License	Apache 2.0

Performance overview

Gemma 4 26B-A4B delivers strong results across a wide range of benchmarks:

Benchmark	Gemma 4 26B-A4B	Gemma 3 27B
MMLU Pro	82.6%	67.6%
AIME 2026 (no tools)	88.3%	20.8%
LiveCodeBench v6	77.1%	29.1%
GPQA Diamond	82.3%	42.4%
MMMLU	86.3%	70.7%
MMMU Pro (vision)	73.8%	49.7%
MATH-Vision	82.4%	46.0%

Best use cases

Use this model when you need a single endpoint that can handle:

Chat with image or video input
Screenshot or UI understanding
OCR and document Q&A
Diagram and chart comprehension
Multilingual assistants
Reasoning-heavy product features (with thinking mode)
Agent pipelines that mix perception and action
Function calling and tool use

Notes

When thinking mode is enabled, the model reasons internally before responding. This generally improves answer quality at the cost of additional output tokens.
For best results with multimodal inputs, the model expects images and video before text in the prompt. This endpoint handles that ordering automatically.
Recommended sampling parameters from Google: temperature=1.0, top_p=0.95, top_k=64.
Because this is a hosted and optimized Replicate deployment, behavior and latency may differ from raw self-hosted Hugging Face checkpoints.

Limitations

Like other large multimodal models, Gemma 4 26B-A4B can still:

Hallucinate facts or visual details
Make mistakes on fine-grained counting or localization
Underperform on highly domain-specific inputs without careful prompting
Produce variable outputs across languages and long contexts
Struggle with subtle nuance, sarcasm, or figurative language

Human review is recommended for high-stakes use cases.

Model created 3 months ago

Run time and cost