zsxkib / kimi-vl-a3b-thinking

Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

Cold

Public
115 runs
L40S
GitHub
Weights
Paper
License

Iterate in playground

Run with an API

Playground API Examples README Versions

Input

Run this model in Node.js with one line of code:

npx create-replicate --model=zsxkib/kimi-vl-a3b-thinking

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run zsxkib/kimi-vl-a3b-thinking using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "zsxkib/kimi-vl-a3b-thinking:c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d",
  {
    input: {
      image: "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
      top_p: 1,
      prompt: "Where am I?",
      temperature: 0.6,
      max_length_tokens: 2048
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run zsxkib/kimi-vl-a3b-thinking using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "zsxkib/kimi-vl-a3b-thinking:c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d",
    input={
        "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
        "top_p": 1,
        "prompt": "Where am I?",
        "temperature": 0.6,
        "max_length_tokens": 2048
    }
)

# The zsxkib/kimi-vl-a3b-thinking model can stream output as it's running.
# The predict method returns an iterator, and you can iterate over that output.
for item in output:
    # https://replicate.com/zsxkib/kimi-vl-a3b-thinking/api#output-schema
    print(item, end="")

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run zsxkib/kimi-vl-a3b-thinking using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "zsxkib/kimi-vl-a3b-thinking:c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d",
    "input": {
      "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
      "top_p": 1,
      "prompt": "Where am I?",
      "temperature": 0.6,
      "max_length_tokens": 2048
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

◁think▷Okay, let's see. The user is asking "Where am I?" and provided an image of a cyberpunk bar scene. The image has a lot of neon lights, futuristic elements, and people sitting at a bar. There's a sign that says "The Heist" and some text about sitting next to Jackie. Also, there's a mini-map in the corner, which might indicate it's from a video game. First, I need to figure out which game this is. The cyberpunk aesthetic with the bar setting and the mini-map suggests it could be Cyberpunk 2077. The Heist as a location is a known area in that game. The characters' outfits and the overall environment match the game's style. The user might be in-game, so the answer should confirm that they're in Cyberpunk 2077, specifically at The Heist bar. I should mention the game's title and the location's name. Also, note the HUD elements as part of the game's interface. Make sure to keep it concise but informative.◁/think▷You are in **Cyberpunk 2077**, a role-playing video game set in a dystopian future version of Night City. The image shows you inside **The Heist**, a bar or nightclub popular in the game. The cyberpunk aesthetic, neon lighting, and HUD elements (mini-map, objectives like "Sit next to Jackie") align with the game's signature style. This location is known for its vibrant atmosphere and interactions with characters like Jackie Welles, a key figure in the early story.<|im_end|>

{
  "completed_at": "2025-05-01T12:15:44.716399Z",
  "created_at": "2025-05-01T12:13:48.911000Z",
  "data_removed": false,
  "error": null,
  "id": "0b9g4m3fxxrme0cphf2ajakdq0",
  "input": {
    "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
    "top_p": 1,
    "prompt": "Where am I?",
    "temperature": 0.6,
    "max_length_tokens": 2048
  },
  "logs": "Loaded image: /tmp/tmplw0gw75ldemo1.jpeg\nImage token added to prompt and image object included.\nProcessing conversation and image (if any)...\nStop words: ['<|im_end|>']\nStarting generation...\nconversations = [{'role': 'user', 'content': '<image>\\nWhere am I?', 'images': [<PIL.Image.Image image mode=RGB size=2516x1410 at 0x7BA35CF966D0>]}, {'role': 'assistant', 'content': None}]\nraw text = <|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|><|media_start|>image<|media_content|><|media_pad|><|media_end|><image>\nWhere am I?<|im_end|><|im_assistant|>assistant<|im_middle|>\nGeneration complete.\nFinished prediction. Params: temp=0.6, top_p=1.0, max_len=2048",
  "metrics": {
    "predict_time": 15.243550127,
    "total_time": 115.805399
  },
  "output": [
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "◁think▷Okay, ",
    "let's ",
    "",
    "see. ",
    "The ",
    "user ",
    "is ",
    "asking ",
    "",
    "\"Where ",
    "am ",
    "",
    "I?\" ",
    "and ",
    "provided ",
    "an ",
    "image ",
    "of ",
    "a ",
    "",
    "",
    "cyberpunk ",
    "bar ",
    "",
    "scene. ",
    "The ",
    "image ",
    "has ",
    "a ",
    "lot ",
    "of ",
    "neon ",
    "",
    "lights, ",
    "futuristic ",
    "",
    "elements, ",
    "and ",
    "people ",
    "sitting ",
    "at ",
    "a ",
    "",
    "bar. ",
    "There's ",
    "a ",
    "sign ",
    "that ",
    "says ",
    "",
    "\"The ",
    "",
    "",
    "Heist\" ",
    "and ",
    "some ",
    "text ",
    "about ",
    "sitting ",
    "next ",
    "to ",
    "",
    "Jackie. ",
    "",
    "Also, ",
    "there's ",
    "a ",
    "",
    "mini-map ",
    "in ",
    "the ",
    "",
    "corner, ",
    "which ",
    "might ",
    "indicate ",
    "it's ",
    "from ",
    "a ",
    "video ",
    "game.\n\n",
    "",
    "",
    "First, ",
    "I ",
    "need ",
    "to ",
    "figure ",
    "out ",
    "which ",
    "game ",
    "this ",
    "",
    "is. ",
    "The ",
    "",
    "",
    "cyberpunk ",
    "aesthetic ",
    "with ",
    "the ",
    "bar ",
    "setting ",
    "and ",
    "the ",
    "",
    "mini-map ",
    "suggests ",
    "it ",
    "could ",
    "be ",
    "",
    "",
    "Cyberpunk ",
    "",
    "",
    "",
    "2077. ",
    "The ",
    "",
    "Heist ",
    "as ",
    "a ",
    "location ",
    "is ",
    "a ",
    "known ",
    "area ",
    "in ",
    "that ",
    "",
    "game. ",
    "The ",
    "",
    "characters' ",
    "outfits ",
    "and ",
    "the ",
    "overall ",
    "environment ",
    "match ",
    "the ",
    "game's ",
    "",
    "style. ",
    "The ",
    "user ",
    "might ",
    "be ",
    "",
    "",
    "in-game, ",
    "so ",
    "the ",
    "answer ",
    "should ",
    "confirm ",
    "that ",
    "they're ",
    "in ",
    "",
    "",
    "Cyberpunk ",
    "",
    "",
    "",
    "2077, ",
    "specifically ",
    "at ",
    "The ",
    "",
    "Heist ",
    "",
    "bar. ",
    "I ",
    "should ",
    "mention ",
    "the ",
    "game's ",
    "title ",
    "and ",
    "the ",
    "",
    "location's ",
    "",
    "name. ",
    "",
    "Also, ",
    "note ",
    "the ",
    "HUD ",
    "elements ",
    "as ",
    "part ",
    "of ",
    "the ",
    "game's ",
    "",
    "interface. ",
    "Make ",
    "sure ",
    "to ",
    "keep ",
    "it ",
    "concise ",
    "but ",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "informative.◁/think▷You ",
    "are ",
    "in ",
    "",
    "",
    "",
    "**Cyberpunk ",
    "",
    "",
    "",
    "2077**, ",
    "a ",
    "",
    "role-playing ",
    "video ",
    "game ",
    "set ",
    "in ",
    "a ",
    "",
    "dystopian ",
    "future ",
    "version ",
    "of ",
    "Night ",
    "",
    "City. ",
    "The ",
    "image ",
    "shows ",
    "you ",
    "inside ",
    "",
    "**The ",
    "",
    "",
    "Heist**, ",
    "a ",
    "bar ",
    "or ",
    "nightclub ",
    "popular ",
    "in ",
    "the ",
    "",
    "game. ",
    "The ",
    "",
    "",
    "cyberpunk ",
    "",
    "aesthetic, ",
    "neon ",
    "",
    "lighting, ",
    "and ",
    "HUD ",
    "elements ",
    "",
    "",
    "",
    "(mini-map, ",
    "objectives ",
    "like ",
    "",
    "\"Sit ",
    "next ",
    "to ",
    "",
    "Jackie\") ",
    "align ",
    "with ",
    "the ",
    "game's ",
    "signature ",
    "",
    "style. ",
    "This ",
    "location ",
    "is ",
    "known ",
    "for ",
    "its ",
    "vibrant ",
    "atmosphere ",
    "and ",
    "interactions ",
    "with ",
    "characters ",
    "like ",
    "Jackie ",
    "",
    "",
    "Welles, ",
    "a ",
    "key ",
    "figure ",
    "in ",
    "the ",
    "early ",
    "",
    "",
    "story.<|im_end|>"
  ],
  "started_at": "2025-05-01T12:15:29.472849Z",
  "status": "succeeded",
  "urls": {
    "stream": "https://stream-b.svc.ric1.c.replicate.net/v1/streams/6vxkxygbil3befnlyhlj6f27bp5b2olb5ohqmxetaiwbqh5ytugq",
    "get": "https://api.replicate.com/v1/predictions/0b9g4m3fxxrme0cphf2ajakdq0",
    "cancel": "https://api.replicate.com/v1/predictions/0b9g4m3fxxrme0cphf2ajakdq0/cancel"
  },
  "version": "c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d"
}

Generated in

15.3 seconds

Tweak it Share Report View full prediction

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Kimi-VL: An Open Mixture-of-Experts Vision-Language Model 🦉

Kimi-VL is Moonshot AI’s vision-language model designed for multimodal reasoning, understanding long contexts, and powering capable AI agents—while only activating 2.8 billion parameters at runtime.

There are two flavors of Kimi-VL:

Variant	Best For	Recommended Temperature
Kimi-VL-A3B-Instruct	General perception, OCR, videos, agents	0.2
Kimi-VL-A3B-Thinking	Complex reasoning, math problems, puzzles	0.6

Original GitHub Repository: moonshotai/Kimi-VL
Model Weights on Hugging Face: Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking

Why Kimi-VL Is Useful

Handles Long Contexts: Processes up to 128K tokens at once—think entire academic papers or long-form videos.
Detailed Image Understanding: Its built-in MoonViT encoder keeps images at native resolution, so even tiny details remain clear.
Efficient: Activates just 2.8 billion parameters during use, making it GPU-friendly.
Strong Benchmark Results: Matches or beats larger models like GPT-4o-mini on complex tasks (e.g., 64.5% on LongVideoBench, 36.8% on MathVision).
Fully Open-Source: Licensed entirely under MIT, making it easy and flexible for everyone to use and improve.

Key Features ✨

Multimodal Reasoning: Seamlessly integrates image, video, and text inputs.
Agent-Friendly Interface: Uses the familiar chat-style message format compatible with OpenAI’s chat API.
Supports vLLM and LLaMA-Factory: Fine-tune easily on a single GPU or deploy with vLLM’s fast inference server.
Optimized Speed: Supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

How the Container Works 🛠️

Core Weights: Uses Moonlight-16B-A3B backbone and MoonViT image encoder directly from Hugging Face.
Dependencies: Built with PyTorch 2.5, transformers 4.51, torchvision, flash-attn, and optionally vllm.
Processing Steps:
Loads images using Pillow.
Formats inputs through AutoProcessor.
Executes model inference directly or via vLLM.
Decodes output tokens into human-readable text.

Use Cases 💡

Summarizing or analyzing multi-page documents and large PDFs.
Solving complex math problems using visual data.
Building AI agents capable of reasoning across images, text, and video.

Limitations ⚠️

GPU Memory: Ideally requires at least 24GB VRAM. Use Replicate’s larger GPU instances for full 128K context tasks.
No Direct Video Input: Requires pre-processing of videos into frames externally before use.
Image Limit per Prompt: Currently supports up to eight images per prompt (can be adjusted).

License and Citation 📜

Everything is openly available under the MIT license.

@misc{kimiteam2025kimivl,
  title   = {Kimi-VL Technical Report},
  author  = {Kimi Team and Angang Du et al.},
  year    = {2025},
  eprint  = {2504.07491},
  archivePrefix = {arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.07491}
}