Nemotron 3 Nano

Nemotron 3 Nano is a language model from Nvidia designed for reasoning tasks, coding, and building AI agents. It has 31.6 billion total parameters but only uses about 3.6 billion at a time, which makes it much faster than similar-sized models while still being accurate.

What makes it different

The model uses a hybrid architecture that combines three different techniques: Mamba-2 layers for handling long contexts efficiently, Transformer attention for detailed reasoning, and mixture-of-experts routing that activates only 6 out of 128 experts for each token. This design gives you the reasoning quality of a much larger model while keeping it fast and affordable to run.

Nemotron 3 Nano can work with up to 1 million tokens of context. That’s enough to fit entire codebases, long documents, or extended conversations without having to split things up into chunks. The architecture handles these large contexts efficiently because it avoids the memory overhead that standard Transformer models have.

What it’s good at

This model excels at math, coding, and multi-step tasks where it needs to use tools or maintain state across many turns. Nvidia trained it using reinforcement learning across different environments including math problems, code generation, tool calling, and conversational tasks.

The model has two modes: thinking mode and direct mode. In thinking mode, it generates internal reasoning steps before giving you the final answer, which improves accuracy on harder problems. In direct mode, it skips the reasoning and gives you the answer immediately. You can control this through the chat template.

Training

Nvidia trained Nemotron 3 Nano on 25 trillion tokens, including web crawls, code, math, science papers, and multilingual content covering 20 languages and 43 programming languages. The training used a three-stage process: massive-scale pre-training, supervised fine-tuning on synthetic data, and reinforcement learning across multiple task environments.

The model’s training data has a cutoff date of June 25, 2025 for pre-training and November 28, 2025 for post-training. It supports English, Spanish, French, German, Japanese, and Italian.

Performance

On an H200 GPU with 8,000 input tokens and 16,000 output tokens, Nemotron 3 Nano delivers 3.3x higher throughput than Qwen3-30B and 2.2x higher than GPT-OSS-20B. It’s about 4x faster than the previous Nemotron Nano 2 model.

The model scores 52 on the Intelligence Index v3.0 from Artificial Analysis, placing it at the top among models of similar size. It performs particularly well on SWE-Bench for code generation, GPQA Diamond for reasoning, and RULER for long-context understanding.

License

This model is released under the NVIDIA Open Model License. You can find the full license terms in Nvidia’s documentation.

You can try the model on the Replicate Playground.

Model created 2 months, 2 weeks ago