prunaai/gemma-4-26b-a4b-fast

This is a version of the MoE Gemma 4 26B optimised by Pruna AI.

Public
52 runs

Run time and cost

This model runs on Nvidia H100 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Gemma 4 26B-A4B

Multimodal reasoning model for text, images, and video.

This Replicate endpoint serves an optimized version of Gemma 4 26B-A4B, a 26B-parameter vision-language MoE (Mixture of Experts) model from Google DeepMind designed for instruction following, reasoning, coding, document understanding, and agent-style workflows.

Compared with the original Hugging Face model card, this page focuses on the hosted experience: fast access to a production-ready version of the model without the self-hosting setup.

What it does

Gemma 4 26B-A4B is a general-purpose multimodal model that can:

  • Answer questions about text, images, and video
  • Reason over diagrams, charts, and visual documents
  • Follow complex instructions with native system prompt support
  • Perform coding and agent-style tasks with built-in function calling
  • Handle long-context workloads up to 256K tokens
  • Work across 140+ languages
  • Think step-by-step before answering (configurable thinking mode)

Why use this model

Gemma 4 26B-A4B combines strong language reasoning with native multimodal understanding in an efficient MoE architecture — only 4B parameters are active per token out of 26B total, giving near-4B-model speed with much larger model quality. It is well suited for:

  • Visual question answering
  • Document parsing, OCR, and data extraction
  • Coding and technical assistance
  • Multilingual assistants
  • Long-context analysis (up to 256K tokens)
  • Agentic applications with tool use and function calling
  • Reasoning-heavy product features

Highlights

  • Efficient MoE architecture: 128 experts with 8 active per token — runs almost as fast as a 4B model while leveraging 26B total parameters
  • Built-in thinking mode: configurable step-by-step reasoning before answering
  • Native function calling: designed for agentic and tool-calling workflows
  • Variable image resolution: configurable visual token budget (70 to 1120 tokens per image) for balancing detail vs speed
  • 256K context window: native support for very long inputs
  • Native system prompt support: structured and controllable conversations via the system role
  • Broad language coverage: pre-trained on 140+ languages with strong multilingual performance
  • Apache 2.0 license: permissive open-source license

Model details

Property Value
Model google/gemma-4-26B-A4B-it
Architecture Causal language model with vision encoder (MoE)
Total parameters 25.2B
Active parameters ~3.8B per token
Experts 128 total, 8 active + 1 shared
Layers 30
Context length 262,144 tokens
Modality support Text, images, video
License Apache 2.0

Performance overview

Gemma 4 26B-A4B delivers strong results across a wide range of benchmarks:

Benchmark Gemma 4 26B-A4B Gemma 3 27B
MMLU Pro 82.6% 67.6%
AIME 2026 (no tools) 88.3% 20.8%
LiveCodeBench v6 77.1% 29.1%
GPQA Diamond 82.3% 42.4%
MMMLU 86.3% 70.7%
MMMU Pro (vision) 73.8% 49.7%
MATH-Vision 82.4% 46.0%

Best use cases

Use this model when you need a single endpoint that can handle:

  • Chat with image or video input
  • Screenshot or UI understanding
  • OCR and document Q&A
  • Diagram and chart comprehension
  • Multilingual assistants
  • Reasoning-heavy product features (with thinking mode)
  • Agent pipelines that mix perception and action
  • Function calling and tool use

Notes

  • When thinking mode is enabled, the model reasons internally before responding. This generally improves answer quality at the cost of additional output tokens.
  • For best results with multimodal inputs, the model expects images and video before text in the prompt. This endpoint handles that ordering automatically.
  • Recommended sampling parameters from Google: temperature=1.0, top_p=0.95, top_k=64.
  • Because this is a hosted and optimized Replicate deployment, behavior and latency may differ from raw self-hosted Hugging Face checkpoints.

Limitations

Like other large multimodal models, Gemma 4 26B-A4B can still:

  • Hallucinate facts or visual details
  • Make mistakes on fine-grained counting or localization
  • Underperform on highly domain-specific inputs without careful prompting
  • Produce variable outputs across languages and long contexts
  • Struggle with subtle nuance, sarcasm, or figurative language

Human review is recommended for high-stakes use cases.

Model created
Model updated