Apollo 7B
Apollo is a family of large multimodal models built for video understanding. Developed by Meta researchers, the 7 billion parameter version handles long-form video comprehension, temporal reasoning, and multi-turn conversations grounded in video content.
What it does
Apollo takes a video and a text prompt, then answers questions about the video content. It can handle hour-long videos, describe what’s happening at specific moments, reason about the order of events, and have back-and-forth conversations about what it sees.
Video question-answering
Ask the model about anything in a video: what’s happening, who’s doing what, how things change over time. It handles complex temporal reasoning — understanding not just individual frames, but the flow of events across the entire video.
Long-form video comprehension
Apollo processes videos from short clips to hour-long recordings. The model uses efficient sampling strategies and token resampling to handle long videos without losing important details.
How it works
The model combines two specialized vision encoders — SigLIP for spatial understanding and InternVideo2 for temporal/motion understanding — feeding into a language model backbone based on Qwen2.5-7B. This dual-encoder approach lets Apollo understand both what things look like and how they move.
Key design choices:
- fps-based video sampling rather than uniform frame sampling, which better captures the actual timing of events
- Perceiver-based token resampling to efficiently compress video information without losing temporal detail
- Scaling consistency: design decisions validated on smaller models transfer effectively to larger scales
Model variants
The Apollo family comes in several sizes:
| Model | Base Model | Parameters |
|---|---|---|
| Apollo-1.5B | Qwen2.5-1.5B | 1.5B |
| Apollo-3B | Qwen2.5-3B | 3B |
| Apollo-7B (this model) | Qwen2.5-7B | 7B |
The 7B version outperforms most competing models at its size and rivals some 30B-scale models on video understanding benchmarks.
Performance
Apollo-7B achieves strong results on standard video understanding benchmarks. It also introduces ApolloBench, a streamlined evaluation benchmark that runs 41× faster than existing alternatives while focusing on genuine video understanding capabilities.
Links
License
Apache 2.0