zsxkib / kimi-vl-a3b-thinking

Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

  • Public
  • 115 runs
  • GitHub
  • Weights
  • Paper
  • License
Iterate in playground

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Kimi-VL: An Open Mixture-of-Experts Vision-Language Model 🦉

Kimi-VL is Moonshot AI’s vision-language model designed for multimodal reasoning, understanding long contexts, and powering capable AI agents—while only activating 2.8 billion parameters at runtime.

There are two flavors of Kimi-VL:

Variant Best For Recommended Temperature
Kimi-VL-A3B-Instruct General perception, OCR, videos, agents 0.2
Kimi-VL-A3B-Thinking Complex reasoning, math problems, puzzles 0.6

Why Kimi-VL Is Useful

  • Handles Long Contexts: Processes up to 128K tokens at once—think entire academic papers or long-form videos.
  • Detailed Image Understanding: Its built-in MoonViT encoder keeps images at native resolution, so even tiny details remain clear.
  • Efficient: Activates just 2.8 billion parameters during use, making it GPU-friendly.
  • Strong Benchmark Results: Matches or beats larger models like GPT-4o-mini on complex tasks (e.g., 64.5% on LongVideoBench, 36.8% on MathVision).
  • Fully Open-Source: Licensed entirely under MIT, making it easy and flexible for everyone to use and improve.

Key Features ✨

  • Multimodal Reasoning: Seamlessly integrates image, video, and text inputs.
  • Agent-Friendly Interface: Uses the familiar chat-style message format compatible with OpenAI’s chat API.
  • Supports vLLM and LLaMA-Factory: Fine-tune easily on a single GPU or deploy with vLLM’s fast inference server.
  • Optimized Speed: Supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

How the Container Works 🛠️

  • Core Weights: Uses Moonlight-16B-A3B backbone and MoonViT image encoder directly from Hugging Face.
  • Dependencies: Built with PyTorch 2.5, transformers 4.51, torchvision, flash-attn, and optionally vllm.
  • Processing Steps:
  • Loads images using Pillow.
  • Formats inputs through AutoProcessor.
  • Executes model inference directly or via vLLM.
  • Decodes output tokens into human-readable text.

Use Cases 💡

  • Summarizing or analyzing multi-page documents and large PDFs.
  • Solving complex math problems using visual data.
  • Building AI agents capable of reasoning across images, text, and video.

Limitations ⚠️

  • GPU Memory: Ideally requires at least 24GB VRAM. Use Replicate’s larger GPU instances for full 128K context tasks.
  • No Direct Video Input: Requires pre-processing of videos into frames externally before use.
  • Image Limit per Prompt: Currently supports up to eight images per prompt (can be adjusted).

License and Citation 📜

Everything is openly available under the MIT license.

@misc{kimiteam2025kimivl,
  title   = {Kimi-VL Technical Report},
  author  = {Kimi Team and Angang Du et al.},
  year    = {2025},
  eprint  = {2504.07491},
  archivePrefix = {arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.07491}
}

Acknowledgements

Big thanks to Moonshot AI for releasing Kimi-VL as open-source, and to the teams at vLLM and LLaMA-Factory for their rapid integration and support.


Star this project on GitHub: moonshotai/Kimi-VL

🐦 Follow me on Twitter: @zsakib_