Readme
Claude Opus 4.6
Claude Opus 4.6 is Anthropic’s flagship language model built for complex, multi-step work. The model excels at coding, financial analysis, legal reasoning, and long-running tasks that need sustained focus.
What you can do with it
Claude Opus 4.6 is designed for tasks that need deep reasoning and the ability to work autonomously over longer periods:
Coding and software engineering
The model plans carefully, operates reliably in large codebases, and catches its own mistakes through improved code review and debugging. It scored the highest on Terminal-Bench 2.0 (65.4%), an evaluation that tests how well models complete real-world coding tasks in a terminal environment.
Financial analysis and research
Claude Opus 4.6 can combine regulatory filings, market reports, and internal data to produce analyses that would take human analysts days to complete. It achieves state-of-the-art performance on Finance Agent (60.7%), a benchmark that tests core financial analyst tasks like researching SEC filings.
Legal reasoning and document work
With a 90.2% score on BigLaw Bench, the model handles complex legal reasoning across multiple documents. It can read through large sets of legal or technical documents and extract specific information rather than just providing summaries.
Finding information in large contexts
The model is substantially better at retrieving relevant information from vast amounts of text. On MRCR v2, a benchmark that tests a model’s ability to find information hidden in large contexts, Claude Opus 4.6 scores 76% compared to just 18.5% for Claude Sonnet 4.5.
How it works
Claude Opus 4.6 uses extended thinking to work through complex problems more carefully. The model can decide when deeper reasoning would help and when to move quickly through straightforward parts of a task.
The model features a one million token context window in beta, letting it process and reason across much more information than previous versions. It can also output up to 128,000 tokens, enough to complete substantial documents or code without breaking them into multiple pieces.
Key improvements
Better planning and focus
The model plans more carefully at the start of tasks, sustains work over longer periods, and revisits its reasoning before settling on answers. This produces better results on harder problems, though it can add cost and latency on simpler ones.
Stronger knowledge retrieval
Claude Opus 4.6 holds and tracks information over hundreds of thousands of tokens with less drift than its predecessors. It picks up buried details that even Claude Opus 4.5 would miss.
Higher quality outputs
Documents, spreadsheets, and presentations come out closer to production-ready quality on the first pass. The model needs fewer revisions to reach the level of polish needed for professional work.
Performance
Claude Opus 4.6 leads across several industry benchmarks:
- Terminal-Bench 2.0: 65.4% (highest score for agentic coding)
- OSWorld: 72.7% (agentic computer use)
- Finance Agent: 60.7% (financial analysis tasks)
- BigLaw Bench: 90.2% (legal reasoning)
- ARC AGI 2: 68.8% (abstract problem solving)
- Humanity’s Last Exam: leads all frontier models (multidisciplinary reasoning)
On GDPval-AA, which measures performance on economically valuable knowledge work in finance, legal, and other domains, Claude Opus 4.6 outperforms the industry’s next-best model by around 144 Elo points.
When to use it
Use Claude Opus 4.6 when you need:
- High-quality code generation and debugging in large codebases
- Deep analysis across multiple data sources
- Documents or presentations that need minimal revision
- Information retrieval from large sets of documents
- Tasks that require sustained focus over multiple steps
- Complex reasoning across different domains
The model works well for tasks where quality matters more than speed, and where the work would otherwise take significant human time and expertise.
Notes
Claude Opus 4.6 often thinks more deeply and revisits its reasoning before settling on an answer. This produces better results on harder problems but can add cost and latency on simpler ones. For tasks where the model might be overthinking, you can adjust the effort parameter to balance quality, speed, and cost.
The model maintains a strong safety profile with low rates of misaligned behavior across safety evaluations. It matches or exceeds other frontier models in terms of safety and alignment.
Try Claude Opus 4.6 on the Replicate Playground.