lucataco / apollo-3b

Apollo 3B - An Exploration of Video Understanding in Large Multimodal Models

  • Public
  • 118 runs
  • GitHub
  • Weights
  • Paper
  • License
Iterate in playground

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Apollo is a family of Large Multimodal Models (LMMs) that push the state-of-the-art in video understanding. It supports tasks including: - Long-form video comprehension - Temporal reasoning - Complex video question-answering - Multi-turn conversations grounded in video content

Apollo models excel at handling hour-long videos, balancing speed and accuracy through strategic design decisions. Our models outperform most 7B competitors at just 3B parameters and even rival 30B-scale models.

Key Highlights: - Scaling Consistency: Design decisions validated on smaller models and datasets effectively transfer to larger scales, reducing computation and experimentation costs. - Efficient Video Sampling: fps sampling and advanced token resampling strategies (e.g., Perceiver) yield stronger temporal perception. - Encoder Synergies: Combining SigLIP-SO400M (image) with InternVideo2 (video) delivers a robust representation, outperforming single encoders on temporal tasks. - ApolloBench: A streamlined evaluation benchmark (41x faster) that focuses on true video understanding capabilities.

Citation

If you find this project useful, please consider citing:

@article{zohar2024apollo,
    title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
    author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
    journal={arXiv preprint arXiv:2412.10360},
    year={2024}
}

For more details, visit the project website or check out the paper.