Apollo: An Exploration of Video Understanding in Large Multimodal Models

Apollo is a family of Large Multimodal Models (LMMs) that push the state-of-the-art in video understanding. It supports tasks including: - Long-form video comprehension - Temporal reasoning - Complex video question-answering - Multi-turn conversations grounded in video content

Apollo models excel at handling hour-long videos, balancing speed and accuracy through strategic design decisions. Our models outperform most 7B competitors at just 3B parameters and even rival 30B-scale models.

Key Highlights: - Scaling Consistency: Design decisions validated on smaller models and datasets effectively transfer to larger scales, reducing computation and experimentation costs. - Efficient Video Sampling: fps sampling and advanced token resampling strategies (e.g., Perceiver) yield stronger temporal perception. - Encoder Synergies: Combining SigLIP-SO400M (image) with InternVideo2 (video) delivers a robust representation, outperforming single encoders on temporal tasks. - ApolloBench: A streamlined evaluation benchmark (41x faster) that focuses on true video understanding capabilities.

Citation

If you find this project useful, please consider citing:

@article{zohar2024apollo,
    title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
    author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
    journal={arXiv preprint arXiv:2412.10360},
    year={2024}
}

For more details, visit the project website or check out the paper.

Model created over 1 year ago