prunaai/gemma-4-26b-a4b-fast

This is a version of the MoE Gemma 4 26B optimised by Pruna AI.

Public
53 runs

Gemma 4 26B-A4B

Multimodal reasoning model for text, images, and video.

This Replicate endpoint serves an optimized version of Gemma 4 26B-A4B, a 26B-parameter vision-language MoE (Mixture of Experts) model from Google DeepMind designed for instruction following, reasoning, coding, document understanding, and agent-style workflows.

Compared with the original Hugging Face model card, this page focuses on the hosted experience: fast access to a production-ready version of the model without the self-hosting setup.

What it does

Gemma 4 26B-A4B is a general-purpose multimodal model that can:

  • Answer questions about text, images, and video
  • Reason over diagrams, charts, and visual documents
  • Follow complex instructions with native system prompt support
  • Perform coding and agent-style tasks with built-in function calling
  • Handle long-context workloads up to 256K tokens
  • Work across 140+ languages
  • Think step-by-step before answering (configurable thinking mode)

Why use this model

Gemma 4 26B-A4B combines strong language reasoning with native multimodal understanding in an efficient MoE architecture — only 4B parameters are active per token out of 26B total, giving near-4B-model speed with much larger model quality. It is well suited for:

  • Visual question answering
  • Document parsing, OCR, and data extraction
  • Coding and technical assistance
  • Multilingual assistants
  • Long-context analysis (up to 256K tokens)
  • Agentic applications with tool use and function calling
  • Reasoning-heavy product features

Highlights

  • Efficient MoE architecture: 128 experts with 8 active per token — runs almost as fast as a 4B model while leveraging 26B total parameters
  • Built-in thinking mode: configurable step-by-step reasoning before answering
  • Native function calling: designed for agentic and tool-calling workflows
  • Variable image resolution: configurable visual token budget (70 to 1120 tokens per image) for balancing detail vs speed
  • 256K context window: native support for very long inputs
  • Native system prompt support: structured and controllable conversations via the system role
  • Broad language coverage: pre-trained on 140+ languages with strong multilingual performance
  • Apache 2.0 license: permissive open-source license

Model details

Property Value
Model google/gemma-4-26B-A4B-it
Architecture Causal language model with vision encoder (MoE)
Total parameters 25.2B
Active parameters ~3.8B per token
Experts 128 total, 8 active + 1 shared
Layers 30
Context length 262,144 tokens
Modality support Text, images, video
License Apache 2.0

Performance overview

Gemma 4 26B-A4B delivers strong results across a wide range of benchmarks:

Benchmark Gemma 4 26B-A4B Gemma 3 27B
MMLU Pro 82.6% 67.6%
AIME 2026 (no tools) 88.3% 20.8%
LiveCodeBench v6 77.1% 29.1%
GPQA Diamond 82.3% 42.4%
MMMLU 86.3% 70.7%
MMMU Pro (vision) 73.8% 49.7%
MATH-Vision 82.4% 46.0%

Best use cases

Use this model when you need a single endpoint that can handle:

  • Chat with image or video input
  • Screenshot or UI understanding
  • OCR and document Q&A
  • Diagram and chart comprehension
  • Multilingual assistants
  • Reasoning-heavy product features (with thinking mode)
  • Agent pipelines that mix perception and action
  • Function calling and tool use

Notes

  • When thinking mode is enabled, the model reasons internally before responding. This generally improves answer quality at the cost of additional output tokens.
  • For best results with multimodal inputs, the model expects images and video before text in the prompt. This endpoint handles that ordering automatically.
  • Recommended sampling parameters from Google: temperature=1.0, top_p=0.95, top_k=64.
  • Because this is a hosted and optimized Replicate deployment, behavior and latency may differ from raw self-hosted Hugging Face checkpoints.

Limitations

Like other large multimodal models, Gemma 4 26B-A4B can still:

  • Hallucinate facts or visual details
  • Make mistakes on fine-grained counting or localization
  • Underperform on highly domain-specific inputs without careful prompting
  • Produce variable outputs across languages and long contexts
  • Struggle with subtle nuance, sarcasm, or figurative language

Human review is recommended for high-stakes use cases.

Model created
Model updated