lucataco/apollo-7b

Apollo 7B - An Exploration of Video Understanding in Large Multimodal Models

Public
124.8K runs

Run time and cost

This model costs approximately $0.013 to run on Replicate, or 76 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 14 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Apollo 7B

Apollo is a family of large multimodal models built for video understanding. Developed by Meta researchers, the 7 billion parameter version handles long-form video comprehension, temporal reasoning, and multi-turn conversations grounded in video content.

What it does

Apollo takes a video and a text prompt, then answers questions about the video content. It can handle hour-long videos, describe what’s happening at specific moments, reason about the order of events, and have back-and-forth conversations about what it sees.

Video question-answering

Ask the model about anything in a video: what’s happening, who’s doing what, how things change over time. It handles complex temporal reasoning — understanding not just individual frames, but the flow of events across the entire video.

Long-form video comprehension

Apollo processes videos from short clips to hour-long recordings. The model uses efficient sampling strategies and token resampling to handle long videos without losing important details.

How it works

The model combines two specialized vision encoders — SigLIP for spatial understanding and InternVideo2 for temporal/motion understanding — feeding into a language model backbone based on Qwen2.5-7B. This dual-encoder approach lets Apollo understand both what things look like and how they move.

Key design choices:

  • fps-based video sampling rather than uniform frame sampling, which better captures the actual timing of events
  • Perceiver-based token resampling to efficiently compress video information without losing temporal detail
  • Scaling consistency: design decisions validated on smaller models transfer effectively to larger scales

Model variants

The Apollo family comes in several sizes:

Model Base Model Parameters
Apollo-1.5B Qwen2.5-1.5B 1.5B
Apollo-3B Qwen2.5-3B 3B
Apollo-7B (this model) Qwen2.5-7B 7B

The 7B version outperforms most competing models at its size and rivals some 30B-scale models on video understanding benchmarks.

Performance

Apollo-7B achieves strong results on standard video understanding benchmarks. It also introduces ApolloBench, a streamlined evaluation benchmark that runs 41× faster than existing alternatives while focusing on genuine video understanding capabilities.

License

Apache 2.0

Model created