zsxkib / kimi-vl-a3b-thinking

Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

  • Public
  • 20 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License

Input

image
*string
Shift + Return to add a new line

Text prompt for the model

file

Optional image input

number
(minimum: 0, maximum: 1)

Top-p sampling probability

Default: 1

number
(minimum: 0, maximum: 2)

Sampling temperature

Default: 0.6

integer
(minimum: 1)

Maximum number of tokens to generate

Default: 2048

Output

◁think▷Okay, let's see. The user is asking "Where am I?" and provided an image of a cyberpunk bar scene. The image has a lot of neon lights, futuristic elements, and people sitting at a bar. There's a sign that says "The Heist" and some text about sitting next to Jackie. Also, there's a mini-map in the corner, which might indicate it's from a video game. First, I need to figure out which game this is. The cyberpunk aesthetic with the bar setting and the mini-map suggests it could be Cyberpunk 2077. The Heist as a location is a known area in that game. The characters' outfits and the overall environment match the game's style. The user might be in-game, so the answer should confirm that they're in Cyberpunk 2077, specifically at The Heist bar. I should mention the game's title and the location's name. Also, note the HUD elements as part of the game's interface. Make sure to keep it concise but informative.◁/think▷You are in **Cyberpunk 2077**, a role-playing video game set in a dystopian future version of Night City. The image shows you inside **The Heist**, a bar or nightclub popular in the game. The cyberpunk aesthetic, neon lighting, and HUD elements (mini-map, objectives like "Sit next to Jackie") align with the game's signature style. This location is known for its vibrant atmosphere and interactions with characters like Jackie Welles, a key figure in the early story.<|im_end|>
Generated in

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Kimi-VL: An Open Mixture-of-Experts Vision-Language Model 🦉

Kimi-VL is Moonshot AI’s vision-language model designed for multimodal reasoning, understanding long contexts, and powering capable AI agents—while only activating 2.8 billion parameters at runtime.

There are two flavors of Kimi-VL:

Variant Best For Recommended Temperature
Kimi-VL-A3B-Instruct General perception, OCR, videos, agents 0.2
Kimi-VL-A3B-Thinking Complex reasoning, math problems, puzzles 0.6

Why Kimi-VL Is Useful

  • Handles Long Contexts: Processes up to 128K tokens at once—think entire academic papers or long-form videos.
  • Detailed Image Understanding: Its built-in MoonViT encoder keeps images at native resolution, so even tiny details remain clear.
  • Efficient: Activates just 2.8 billion parameters during use, making it GPU-friendly.
  • Strong Benchmark Results: Matches or beats larger models like GPT-4o-mini on complex tasks (e.g., 64.5% on LongVideoBench, 36.8% on MathVision).
  • Fully Open-Source: Licensed entirely under MIT, making it easy and flexible for everyone to use and improve.

Key Features ✨

  • Multimodal Reasoning: Seamlessly integrates image, video, and text inputs.
  • Agent-Friendly Interface: Uses the familiar chat-style message format compatible with OpenAI’s chat API.
  • Supports vLLM and LLaMA-Factory: Fine-tune easily on a single GPU or deploy with vLLM’s fast inference server.
  • Optimized Speed: Supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

How the Container Works 🛠️

  • Core Weights: Uses Moonlight-16B-A3B backbone and MoonViT image encoder directly from Hugging Face.
  • Dependencies: Built with PyTorch 2.5, transformers 4.51, torchvision, flash-attn, and optionally vllm.
  • Processing Steps:
  • Loads images using Pillow.
  • Formats inputs through AutoProcessor.
  • Executes model inference directly or via vLLM.
  • Decodes output tokens into human-readable text.

Use Cases 💡

  • Summarizing or analyzing multi-page documents and large PDFs.
  • Solving complex math problems using visual data.
  • Building AI agents capable of reasoning across images, text, and video.

Limitations ⚠️

  • GPU Memory: Ideally requires at least 24GB VRAM. Use Replicate’s larger GPU instances for full 128K context tasks.
  • No Direct Video Input: Requires pre-processing of videos into frames externally before use.
  • Image Limit per Prompt: Currently supports up to eight images per prompt (can be adjusted).

License and Citation 📜

Everything is openly available under the MIT license.

@misc{kimiteam2025kimivl,
  title   = {Kimi-VL Technical Report},
  author  = {Kimi Team and Angang Du et al.},
  year    = {2025},
  eprint  = {2504.07491},
  archivePrefix = {arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.07491}
}

Acknowledgements

Big thanks to Moonshot AI for releasing Kimi-VL as open-source, and to the teams at vLLM and LLaMA-Factory for their rapid integration and support.


Star this project on GitHub: moonshotai/Kimi-VL

🐦 Follow me on Twitter: @zsakib_