Official

openai / gpt-4.1

OpenAI's Flagship GPT model for complex tasks.

  • Public
  • 3.5K runs
  • Priced per token
  • Commercial use
  • License
Iterate in playground

Input

string
Shift + Return to add a new line

The prompt to send to the model. Do not use if using messages.

string
Shift + Return to add a new line

System prompt to set the assistant's behavior

file[]
https://replicate.delivery/pbxt/MvnA4wptE8FOHD44bKsfVj8hQdXSvdDAcFgYs5GEODou9OP9/4b2ebb2d-89d8-43de-bc84-51c380365a40.jpg

List of images to send to the model

Default: []

number
(minimum: 0, maximum: 2)

Sampling temperature between 0 and 2

Default: 1

integer

Maximum number of completion tokens to generate

Default: 4096

number
(minimum: 0, maximum: 1)

Nucleus sampling parameter - the model considers the results of the tokens with top_p probability mass. (0.1 means only the tokens comprising the top 10% probability mass are considered.)

Default: 1

number
(minimum: -2, maximum: 2)

Frequency penalty parameter - positive values penalize the repetition of tokens.

Default: 0

number
(minimum: -2, maximum: 2)

Presence penalty parameter - positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default: 0

Output

In the image, someone is spreading butter on a slice of toast using a product labeled "BUTTER STICK TYPE." The product resembles a glue stick, but it is meant for butter, allowing the user to easily apply butter to toast by rubbing the stick directly onto the bread. This is a creative and convenient way to spread butter, especially while the toast is still warm and the butter softens and melts quickly.
Generated in
Input tokens
780
Output tokens
84
Tokens per second
21.06 tokens / second
Time to first token

Pricing

Official model
Pricing for official models works differently from other models. Instead of being billed by time, you’re billed by input and output, making pricing more predictable.

This model is priced by how many input tokens are sent and how many output tokens are generated.

TypePer unitPer $1
Input
$2.00 / 1M tokens
or
500K tokens / $1
Output
$8.00 / 1M tokens
or
120K tokens / $1

For example, for $10 you can run around 1,786 predictions where the input is a sentence or two (15 tokens) and the output is a few paragraphs (700 tokens).

Check out our docs for more information about how per-token pricing works on Replicate.

Readme

GPT-4.1 is a high-performance language model optimized for real-world applications, delivering major improvements in coding, instruction following, and long-context comprehension. It supports up to 1 million tokens of context, features a June 2024 knowledge cutoff, and is designed to be more reliable and cost-effective across a wide range of use cases — from building intelligent agents to processing large codebases and documents. GPT‑4.1 offers improved reasoning, faster output, and significantly enhanced formatting fidelity.


Key Capabilities

  • 1M token context window for large document/code handling
  • Improved instruction following, including format adherence, content control, and negative/ordered instructions
  • Top-tier performance in coding tasks and diffs
  • Optimized for agentic workflows, long-context reasoning, and tool use
  • Real-world tested across legal, financial, engineering, and developer tools

Benchmark Highlights

SWE-bench Verified (Coding): 54.6% 

MultiChallenge (Instruction): 38.3% 

IFEval (Format compliance): 87.4% 

Video-MME (Long video QA): 72.0% 

Aider Diff Format Accuracy: 53% 

Graphwalks (Multi-hop reasoning): 62% 

Use Cases

  • Building agentic systems with strong multi-turn coherence
  • Editing and understanding large codebases or diff formats
  • Complex data extraction from lengthy documents
  • Highly structured content generation
  • Multimodal reasoning tasks (e.g., charts, diagrams, videos)

🔧 Developer Notes

  • Available via OpenAI API only
  • Supports up to 32,768 output tokens
  • Compatible with prompt caching and Batch API
  • Designed for production-scale performance and reliability

🧪 Real-World Results

  • Windsurf: 60% higher accuracy on internal code benchmarks; smoother tool usage
  • Qodo: Better suggestions in 55% of pull request reviews, with higher precision and focus
  • Blue J: 53% more accurate on complex tax scenarios
  • Thomson Reuters: 17% improvement in long-document legal review
  • Carlyle: 50% better retrieval accuracy across large financial files