lucataco/apollo-7b

Apollo 7B - An Exploration of Video Understanding in Large Multimodal Models

Public
124.8K runs

Apollo 7B

Apollo is a family of large multimodal models built for video understanding. Developed by Meta researchers, the 7 billion parameter version handles long-form video comprehension, temporal reasoning, and multi-turn conversations grounded in video content.

What it does

Apollo takes a video and a text prompt, then answers questions about the video content. It can handle hour-long videos, describe what’s happening at specific moments, reason about the order of events, and have back-and-forth conversations about what it sees.

Video question-answering

Ask the model about anything in a video: what’s happening, who’s doing what, how things change over time. It handles complex temporal reasoning — understanding not just individual frames, but the flow of events across the entire video.

Long-form video comprehension

Apollo processes videos from short clips to hour-long recordings. The model uses efficient sampling strategies and token resampling to handle long videos without losing important details.

How it works

The model combines two specialized vision encoders — SigLIP for spatial understanding and InternVideo2 for temporal/motion understanding — feeding into a language model backbone based on Qwen2.5-7B. This dual-encoder approach lets Apollo understand both what things look like and how they move.

Key design choices:

  • fps-based video sampling rather than uniform frame sampling, which better captures the actual timing of events
  • Perceiver-based token resampling to efficiently compress video information without losing temporal detail
  • Scaling consistency: design decisions validated on smaller models transfer effectively to larger scales

Model variants

The Apollo family comes in several sizes:

Model Base Model Parameters
Apollo-1.5B Qwen2.5-1.5B 1.5B
Apollo-3B Qwen2.5-3B 3B
Apollo-7B (this model) Qwen2.5-7B 7B

The 7B version outperforms most competing models at its size and rivals some 30B-scale models on video understanding benchmarks.

Performance

Apollo-7B achieves strong results on standard video understanding benchmarks. It also introduces ApolloBench, a streamlined evaluation benchmark that runs 41× faster than existing alternatives while focusing on genuine video understanding capabilities.

License

Apache 2.0

Model created