Input

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

audio

*file

Input audio file for processing. Can be used for speech-to-text (ASR) or audio-to-audio generation.

prompt

string

Shift + Return to add a new line

Optional text prompt to guide the model. For ASR, use prompts like 'Please convert this audio to text' or '请将音频内容转换为文字' (Chinese).

output_type

string

Type of output to generate: 'audio' for audio only, 'text' for transcription only, or 'both' for both audio and text responses.

Default: "both"

return_json

boolean

Return text results in JSON format instead of text file

Default: true

audio_temperature

number

Temperature for audio generation. Higher values (0.8-1.0) increase creativity but may reduce coherence.

Default: 0.8

audio_top_k

integer

Top-k for audio generation. Limits the token selection to the k most likely tokens.

Default: 10

text_temperature

number

Temperature for text generation. Lower values (0.0-0.5) increase factual accuracy.

Default: 0

text_top_k

integer

Top-k for text generation. Limits the token selection to the k most likely tokens.

Default: 5

audio_repetition_penalty

number

Repetition penalty for audio. Values > 1.0 discourage repetition in audio generation.

Default: 1

audio_repetition_window_size

integer

Window size for audio repetition penalty calculation.

Default: 64

text_repetition_penalty

number

Repetition penalty for text. Values > 1.0 discourage repetition in text generation.

Default: 1

text_repetition_window_size

integer

Window size for text repetition penalty calculation.

Default: 16

Run this model in Node.js with one line of code:

npx create-replicate --model=zsxkib/kimi-audio-7b-instruct

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run zsxkib/kimi-audio-7b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "zsxkib/kimi-audio-7b-instruct:7500b32387695e89da3d09271850319ba027969f0c714dfc226361609ff29f2b",
  {
    input: {
      audio: "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav",
      text_top_k: 5,
      audio_top_k: 10,
      output_type: "both",
      return_json: true,
      text_temperature: 0,
      audio_temperature: 0.8,
      text_repetition_penalty: 1,
      audio_repetition_penalty: 1,
      text_repetition_window_size: 16,
      audio_repetition_window_size: 64
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run zsxkib/kimi-audio-7b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "zsxkib/kimi-audio-7b-instruct:7500b32387695e89da3d09271850319ba027969f0c714dfc226361609ff29f2b",
    input={
        "audio": "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav",
        "text_top_k": 5,
        "audio_top_k": 10,
        "output_type": "both",
        "return_json": True,
        "text_temperature": 0,
        "audio_temperature": 0.8,
        "text_repetition_penalty": 1,
        "audio_repetition_penalty": 1,
        "text_repetition_window_size": 16,
        "audio_repetition_window_size": 64
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run zsxkib/kimi-audio-7b-instruct using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "zsxkib/kimi-audio-7b-instruct:7500b32387695e89da3d09271850319ba027969f0c714dfc226361609ff29f2b",
    "input": {
      "audio": "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav",
      "text_top_k": 5,
      "audio_top_k": 10,
      "output_type": "both",
      "return_json": true,
      "text_temperature": 0,
      "audio_temperature": 0.8,
      "text_repetition_penalty": 1,
      "audio_repetition_penalty": 1,
      "text_repetition_window_size": 16,
      "audio_repetition_window_size": 64
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

json_str

I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!

media_path

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

{
  "completed_at": "2025-05-01T14:51:44.698871Z",
  "created_at": "2025-05-01T14:51:39.196000Z",
  "data_removed": false,
  "error": null,
  "id": "zjseryqh7hrma0cphhabp2cgm8",
  "input": {
    "audio": "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav",
    "text_top_k": 5,
    "audio_top_k": 10,
    "output_type": "both",
    "return_json": true,
    "text_temperature": 0,
    "audio_temperature": 0.8,
    "text_repetition_penalty": 1,
    "audio_repetition_penalty": 1,
    "text_repetition_window_size": 16,
    "audio_repetition_window_size": 64
  },
  "logs": "Generating tokens:   0%|          | 0/1463 [00:00<?, ?it/s]\nGenerating tokens:   0%|          | 4/1463 [00:00<00:45, 32.39it/s]\nGenerating tokens:   1%|          | 8/1463 [00:00<00:43, 33.64it/s]\nGenerating tokens:   1%|          | 12/1463 [00:00<00:42, 34.09it/s]\nGenerating tokens:   1%|          | 16/1463 [00:00<00:42, 34.31it/s]\nGenerating tokens:   1%|▏         | 20/1463 [00:00<00:41, 34.45it/s]\nGenerating tokens:   2%|▏         | 24/1463 [00:00<00:41, 34.53it/s]\nGenerating tokens:   2%|▏         | 28/1463 [00:00<00:41, 34.59it/s]\nGenerating tokens:   2%|▏         | 32/1463 [00:00<00:41, 34.62it/s]\nGenerating tokens:   2%|▏         | 36/1463 [00:01<00:41, 34.63it/s]\nGenerating tokens:   3%|▎         | 40/1463 [00:01<00:41, 34.63it/s]\nGenerating tokens:   3%|▎         | 44/1463 [00:01<00:40, 34.62it/s]\nGenerating tokens:   3%|▎         | 48/1463 [00:01<00:40, 34.61it/s]\nGenerating tokens:   4%|▎         | 52/1463 [00:01<00:40, 34.62it/s]\nGenerating tokens:   4%|▍         | 56/1463 [00:01<00:40, 34.61it/s]\nGenerating tokens:   4%|▍         | 60/1463 [00:01<00:40, 34.63it/s]\nGenerating tokens:   4%|▍         | 64/1463 [00:01<00:40, 34.61it/s]\nGenerating tokens:   5%|▍         | 68/1463 [00:01<00:40, 34.58it/s]\nGenerating tokens:   5%|▍         | 72/1463 [00:02<00:40, 34.56it/s]\nGenerating tokens:   5%|▌         | 76/1463 [00:02<00:40, 34.57it/s]\nGenerating tokens:   5%|▌         | 80/1463 [00:02<00:40, 34.55it/s]\nGenerating tokens:   6%|▌         | 84/1463 [00:02<00:39, 34.55it/s]\nGenerating tokens:   6%|▌         | 88/1463 [00:02<00:39, 34.56it/s]\nGenerating tokens:   6%|▋         | 92/1463 [00:02<00:39, 34.58it/s]\nGenerating tokens:   7%|▋         | 96/1463 [00:02<00:39, 34.57it/s]\nGenerating tokens:   7%|▋         | 100/1463 [00:02<00:39, 34.54it/s]\nGenerating tokens:   7%|▋         | 104/1463 [00:03<00:39, 34.54it/s]\nGenerating tokens:   7%|▋         | 108/1463 [00:03<00:39, 34.53it/s]\nGenerating tokens:   8%|▊         | 112/1463 [00:03<00:39, 34.51it/s]\nGenerating tokens:   8%|▊         | 116/1463 [00:03<00:39, 34.52it/s]\nGenerating tokens:   8%|▊         | 120/1463 [00:03<00:38, 34.53it/s]\nGenerating tokens:   8%|▊         | 124/1463 [00:03<00:38, 34.54it/s]\nGenerating tokens:   9%|▊         | 128/1463 [00:03<00:38, 34.55it/s]\nGenerating tokens:   9%|▉         | 132/1463 [00:03<00:38, 34.54it/s]\nGenerating tokens:   9%|▉         | 132/1463 [00:03<00:38, 34.25it/s]\nWritten output to /tmp/output/output.wav\n>>> output text:  I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!\nWritten output to /tmp/output/output.txt",
  "metrics": {
    "predict_time": 5.496979472,
    "total_time": 5.502871
  },
  "output": {
    "json_str": "I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!",
    "media_path": "https://replicate.delivery/xezq/y6NyEvrmX47HCN7y4HtEAvxNwdcHI7Pk2vwv3l28O0Bg8EKF/output.wav"
  },
  "started_at": "2025-05-01T14:51:39.201892Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/zjseryqh7hrma0cphhabp2cgm8",
    "cancel": "https://api.replicate.com/v1/predictions/zjseryqh7hrma0cphhabp2cgm8/cancel"
  },
  "version": "7500b32387695e89da3d09271850319ba027969f0c714dfc226361609ff29f2b"
}

Generated in

5.5 seconds

Tweak it ShareReport View full prediction

Generating tokens:   0%|          | 0/1463 [00:00<?, ?it/s]
Generating tokens:   0%|          | 4/1463 [00:00<00:45, 32.39it/s]
Generating tokens:   1%|          | 8/1463 [00:00<00:43, 33.64it/s]
Generating tokens:   1%|          | 12/1463 [00:00<00:42, 34.09it/s]
Generating tokens:   1%|          | 16/1463 [00:00<00:42, 34.31it/s]
Generating tokens:   1%|▏         | 20/1463 [00:00<00:41, 34.45it/s]
Generating tokens:   2%|▏         | 24/1463 [00:00<00:41, 34.53it/s]
Generating tokens:   2%|▏         | 28/1463 [00:00<00:41, 34.59it/s]
Generating tokens:   2%|▏         | 32/1463 [00:00<00:41, 34.62it/s]
Generating tokens:   2%|▏         | 36/1463 [00:01<00:41, 34.63it/s]
Generating tokens:   3%|▎         | 40/1463 [00:01<00:41, 34.63it/s]
Generating tokens:   3%|▎         | 44/1463 [00:01<00:40, 34.62it/s]
Generating tokens:   3%|▎         | 48/1463 [00:01<00:40, 34.61it/s]
Generating tokens:   4%|▎         | 52/1463 [00:01<00:40, 34.62it/s]
Generating tokens:   4%|▍         | 56/1463 [00:01<00:40, 34.61it/s]
Generating tokens:   4%|▍         | 60/1463 [00:01<00:40, 34.63it/s]
Generating tokens:   4%|▍         | 64/1463 [00:01<00:40, 34.61it/s]
Generating tokens:   5%|▍         | 68/1463 [00:01<00:40, 34.58it/s]
Generating tokens:   5%|▍         | 72/1463 [00:02<00:40, 34.56it/s]
Generating tokens:   5%|▌         | 76/1463 [00:02<00:40, 34.57it/s]
Generating tokens:   5%|▌         | 80/1463 [00:02<00:40, 34.55it/s]
Generating tokens:   6%|▌         | 84/1463 [00:02<00:39, 34.55it/s]
Generating tokens:   6%|▌         | 88/1463 [00:02<00:39, 34.56it/s]
Generating tokens:   6%|▋         | 92/1463 [00:02<00:39, 34.58it/s]
Generating tokens:   7%|▋         | 96/1463 [00:02<00:39, 34.57it/s]
Generating tokens:   7%|▋         | 100/1463 [00:02<00:39, 34.54it/s]
Generating tokens:   7%|▋         | 104/1463 [00:03<00:39, 34.54it/s]
Generating tokens:   7%|▋         | 108/1463 [00:03<00:39, 34.53it/s]
Generating tokens:   8%|▊         | 112/1463 [00:03<00:39, 34.51it/s]
Generating tokens:   8%|▊         | 116/1463 [00:03<00:39, 34.52it/s]
Generating tokens:   8%|▊         | 120/1463 [00:03<00:38, 34.53it/s]
Generating tokens:   8%|▊         | 124/1463 [00:03<00:38, 34.54it/s]
Generating tokens:   9%|▊         | 128/1463 [00:03<00:38, 34.55it/s]
Generating tokens:   9%|▉         | 132/1463 [00:03<00:38, 34.54it/s]
Generating tokens:   9%|▉         | 132/1463 [00:03<00:38, 34.25it/s]
Written output to /tmp/output/output.wav
>>> output text:  I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!
Written output to /tmp/output/output.txt

Run time and cost

This model costs approximately $0.0012 to run on Replicate, or 833 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 2 seconds.

Readme

Kimi-Audio-7B-Instruct: universal audio model 🔊 (Cog implementation)

This Replicate model runs Kimi-Audio-7B-Instruct, Moonshot AI’s open-source, seven billion-parameter audio model. It listens to any sound, understands what’s happening, and can answer in text or speech. The same checkpoint handles speech-to-text, audio question answering, audio captioning, emotion recognition, sound-event classification, and two-way voice chat. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face, [2504.18425] Kimi-Audio Technical Report - arXiv.org)

GitHub: https://github.com/MoonshotAI/Kimi-Audio (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Technical report: arXiv 2504.18425 ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Hugging Face weights: moonshotai/Kimi-Audio-7B-Instruct (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)

About the model

Kimi-Audio turns raw audio into continuous acoustic features and discrete semantic tokens, then feeds both to a Qwen 2.5 transformer with parallel heads for text and audio generation. A flow-matching vocoder streams 24 kHz speech with about 300 milliseconds of latency. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)

The model was pre-trained on more than thirteen million hours of speech, music, and everyday sounds, then fine-tuned with conversation data so it follows chat prompts. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, How to Install Kimi-Audio 7B Instruct Locally - DEV Community)

Key features

Many skills, one model – speech-to-text, audio Q&A, captioning, emotion tags, sound-event labels, and voice responses. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Real-time streaming – talk to it and get text and speech back while you’re still speaking. ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Chat formatting out of the box – send role, message_type, and content just like other chat models. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Permissive license – code that comes from Qwen 2.5 is Apache 2.0; the rest is MIT. You can use it in commercial products as long as you keep the notices. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …, moonshotai/Kimi-Audio-7B - Hugging Face)

Replicate packaging ⚙️

Component	Source
Transformer weights	`moonshotai/Kimi-Audio-7B-Instruct` (≈ 9.8 GB, bf16) (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Tokenizers + vocoder	Bundled in the GitHub repo (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Docker base image	`moonshotai/kimi-audio:v0.1` (Kimi-Audio/Dockerfile at master · MoonshotAI/Kimi-Audio - GitHub)

The Cog container caches weights under /model_cache and sets HF_HOME and TORCH_HOME so the files are reused across runs.

`predict.py` flow

Load weights, tokenizers, and detokenizer onto the GPU.
Input accepts an audio file or URL plus an optional text prompt. You can tweak temperatures, top-k values, and the random seed.
Generate text only, audio only, or both, using the model’s generate method.
Return a JSON payload with a path to the WAV file (if speech was requested) and the generated text.

Expect about 24 GB of GPU memory for full-precision weights, or roughly 8 GB with 4-bit quantization at slower speed. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)

How it works under the hood

Hybrid tokens – continuous vectors give fine acoustic detail, discrete tokens capture meaning. ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Flow-matching vocoder – converts semantic tokens to waveform with tiny look-ahead. (moonshotai/Kimi-Audio-7B - Hugging Face)
Open evaluation kit – Moonshot provides a benchmarking toolkit if you want to compare your own models. (MoonshotAI/Kimi-Audio-Evalkit - GitHub)

Use cases

Build a voice assistant that actually answers instead of reading web snippets.
Add live captions and emotion tags to video calls.
Monitor factory sounds for unusual events.
Turn recorded meetings into searchable text.

Limitations

Needs a modern NVIDIA card; small personal laptops may struggle.
The current speech synthesizer focuses on English and Mandarin phonemes. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
In very long conversations it can sometimes make up events it never heard. (Kimi-Audio Technical Report - arXiv.org)

License and disclaimer

Kimi-Audio-7B-Instruct weights are MIT. Code that originated in Qwen 2.5 is Apache 2.0. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
You are responsible for any content you generate with this model. Follow local laws and the upstream license terms.

Citation

@misc{kimi_audio_2025,
  title        = {Kimi-Audio Technical Report},
  author       = {Kimi Team},
  year         = {2025},
  eprint       = {2504.18425},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Cog implementation managed by zsxkib.

Star the repo on GitHub once it’s live. ⭐

Follow me on Twitter/X