OpenAI GPT-4o API

Readme

GPT‑4o is OpenAI’s most advanced flagship model, offering natively multimodal capabilities across text, vision, and audio. It delivers GPT-4‑level performance with faster response times and lower cost, making it ideal for real-time, high-volume applications. GPT‑4o supports audio inputs and outputs, handles images and text simultaneously, and is designed to feel conversational and responsive — like interacting with a human assistant in real time.

Key Capabilities

Multimodal input & output: Supports text, images, audio (input) and audio/text (output)
Real-time audio responsiveness: Latency as low as 232 ms
1M token context window for deep reasoning over long content (API)
High performance across reasoning, math, and code tasks
Unified model for all modalities—no need to switch between specialized models

Benchmark Highlights

MMLU (Language understanding):        87.2%
HumanEval (Python coding):           90.2%
GSM8K (Math word problems):          94.4%
MMMU (Vision QA):                    74.1%
VoxCeleb (Speaker ID):               95%+ (est.)
Audio Latency (end-to-end):          ~232–320ms

Use Cases

Real-time voice assistants and spoken dialogue agents
Multimodal document Q&A (PDFs with diagrams, charts, or images)
Code writing, explanation, and debugging
High-volume summarization and extraction from audio/text/image
Tutoring, presentations, and interactive education tools

🔧 Developer Notes

Available via OpenAI API and ChatGPT (Free, Plus, Team, Enterprise)
In ChatGPT, GPT‑4o is now the default GPT-4-level model
Audio input/output is supported only in ChatGPT for now
Image and text input supported via both API and ChatGPT
Supports streaming, function calling, tool use, and vision APIs
Context window of 128k tokens in ChatGPT; 1M tokens via API (limited release)