Apollo: An Exploration of Video Understanding in Large Multimodal Models
Apollo is a family of Large Multimodal Models (LMMs) that push the state-of-the-art in video understanding. It supports tasks including: - Long-form video comprehension - Temporal reasoning - Complex video question-answering - Multi-turn conversations grounded in video content
Apollo models excel at handling hour-long videos, balancing speed and accuracy through strategic design decisions. Our models outperform most 7B competitors at just 3B parameters and even rival 30B-scale models.
Key Highlights: - Scaling Consistency: Design decisions validated on smaller models and datasets effectively transfer to larger scales, reducing computation and experimentation costs. - Efficient Video Sampling: fps sampling and advanced token resampling strategies (e.g., Perceiver) yield stronger temporal perception. - Encoder Synergies: Combining SigLIP-SO400M (image) with InternVideo2 (video) delivers a robust representation, outperforming single encoders on temporal tasks. - ApolloBench: A streamlined evaluation benchmark (41x faster) that focuses on true video understanding capabilities.
Citation
If you find this project useful, please consider citing:
@article{zohar2024apollo,
title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
journal={arXiv preprint arXiv:2412.10360},
year={2024}
}
For more details, visit the project website or check out the paper.