lucataco/apollo-3b

Apollo 3B - An Exploration of Video Understanding in Large Multimodal Models

Public
145 runs

Input

*file
Input video file
string
Shift + Return to add a new line
Question or prompt about the video

Default: "Describe this video in detail"

number
(minimum: 0.1, maximum: 2)
Sampling temperature

Default: 0.4

integer
(minimum: 32, maximum: 1024)
Maximum number of tokens to generate

Default: 256

number
(minimum: 0, maximum: 1)
Top-p sampling probability

Default: 0.7

Output

The video features a lone astronaut in a white spacesuit, equipped with a helmet and gloves, standing on the moon's surface. The backdrop is dominated by a large, detailed image of the moon, set against a starry space. The astronaut begins to run across the lunar terrain, leaving footprints behind. As he runs, the camera angle shifts to reveal more of the moon's rugged landscape. The astronaut continues his run until he reaches the edge of the frame, where he leaps into the vast expanse of space, floating away from the moon. Throughout the sequence, the moon remains a constant and prominent feature in the background, emphasizing the astronaut's journey into the cosmos. The video captures the astronaut's solitary trek across the moon's surface and his subsequent leap into the unknown, symbolizing humanity's boundless curiosity and spirit of exploration. The astronaut's actions are depicted with precision and grace, highlighting the beauty and isolation of space travel. The video concludes with the astronaut floating freely in space, surrounded by the endless void of space, underscoring the awe-inspiring scale and mystery of the universe. The astronaut's journey serves as a powerful metaphor for human ambition and the quest for knowledge, encapsulating the essence of space exploration. The
Generated in

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Apollo is a family of Large Multimodal Models (LMMs) that push the state-of-the-art in video understanding. It supports tasks including: - Long-form video comprehension - Temporal reasoning - Complex video question-answering - Multi-turn conversations grounded in video content

Apollo models excel at handling hour-long videos, balancing speed and accuracy through strategic design decisions. Our models outperform most 7B competitors at just 3B parameters and even rival 30B-scale models.

Key Highlights: - Scaling Consistency: Design decisions validated on smaller models and datasets effectively transfer to larger scales, reducing computation and experimentation costs. - Efficient Video Sampling: fps sampling and advanced token resampling strategies (e.g., Perceiver) yield stronger temporal perception. - Encoder Synergies: Combining SigLIP-SO400M (image) with InternVideo2 (video) delivers a robust representation, outperforming single encoders on temporal tasks. - ApolloBench: A streamlined evaluation benchmark (41x faster) that focuses on true video understanding capabilities.

Citation

If you find this project useful, please consider citing:

@article{zohar2024apollo,
    title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
    author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
    journal={arXiv preprint arXiv:2412.10360},
    year={2024}
}

For more details, visit the project website or check out the paper.