Readme
Kimi-VL: An Open Mixture-of-Experts Vision-Language Model 🦉
Kimi-VL is Moonshot AI’s vision-language model designed for multimodal reasoning, understanding long contexts, and powering capable AI agents—while only activating 2.8 billion parameters at runtime.
There are two flavors of Kimi-VL:
Variant | Best For | Recommended Temperature |
---|---|---|
Kimi-VL-A3B-Instruct | General perception, OCR, videos, agents | 0.2 |
Kimi-VL-A3B-Thinking | Complex reasoning, math problems, puzzles | 0.6 |
- Original GitHub Repository: moonshotai/Kimi-VL
- Model Weights on Hugging Face: Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking
Why Kimi-VL Is Useful
- Handles Long Contexts: Processes up to 128K tokens at once—think entire academic papers or long-form videos.
- Detailed Image Understanding: Its built-in MoonViT encoder keeps images at native resolution, so even tiny details remain clear.
- Efficient: Activates just 2.8 billion parameters during use, making it GPU-friendly.
- Strong Benchmark Results: Matches or beats larger models like GPT-4o-mini on complex tasks (e.g., 64.5% on LongVideoBench, 36.8% on MathVision).
- Fully Open-Source: Licensed entirely under MIT, making it easy and flexible for everyone to use and improve.
Key Features ✨
- Multimodal Reasoning: Seamlessly integrates image, video, and text inputs.
- Agent-Friendly Interface: Uses the familiar chat-style message format compatible with OpenAI’s chat API.
- Supports vLLM and LLaMA-Factory: Fine-tune easily on a single GPU or deploy with vLLM’s fast inference server.
- Optimized Speed: Supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.
How the Container Works 🛠️
- Core Weights: Uses Moonlight-16B-A3B backbone and MoonViT image encoder directly from Hugging Face.
- Dependencies: Built with PyTorch 2.5,
transformers
4.51,torchvision
,flash-attn
, and optionallyvllm
. - Processing Steps:
- Loads images using Pillow.
- Formats inputs through
AutoProcessor
. - Executes model inference directly or via vLLM.
- Decodes output tokens into human-readable text.
Use Cases 💡
- Summarizing or analyzing multi-page documents and large PDFs.
- Solving complex math problems using visual data.
- Building AI agents capable of reasoning across images, text, and video.
Limitations ⚠️
- GPU Memory: Ideally requires at least 24GB VRAM. Use Replicate’s larger GPU instances for full 128K context tasks.
- No Direct Video Input: Requires pre-processing of videos into frames externally before use.
- Image Limit per Prompt: Currently supports up to eight images per prompt (can be adjusted).
License and Citation 📜
Everything is openly available under the MIT license.
@misc{kimiteam2025kimivl,
title = {Kimi-VL Technical Report},
author = {Kimi Team and Angang Du et al.},
year = {2025},
eprint = {2504.07491},
archivePrefix = {arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07491}
}
Acknowledgements
Big thanks to Moonshot AI for releasing Kimi-VL as open-source, and to the teams at vLLM and LLaMA-Factory for their rapid integration and support.
⭐ Star this project on GitHub: moonshotai/Kimi-VL
🐦 Follow me on Twitter: @zsakib_