zsxkib / kimi-audio-7b-instruct

🎧 Kimi-Audio-7B-Instruct, ASR, audio reasoning, captioning, emotion sensing, and TTS into one universal model 🔊

  • Public
  • 490 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
*file

Input audio file for processing. Can be used for speech-to-text (ASR) or audio-to-audio generation.

string
Shift + Return to add a new line

Optional text prompt to guide the model. For ASR, use prompts like 'Please convert this audio to text' or '请将音频内容转换为文字' (Chinese).

string

Type of output to generate: 'audio' for audio only, 'text' for transcription only, or 'both' for both audio and text responses.

Default: "both"

boolean

Return text results in JSON format instead of text file

Default: true

number

Temperature for audio generation. Higher values (0.8-1.0) increase creativity but may reduce coherence.

Default: 0.8

integer

Top-k for audio generation. Limits the token selection to the k most likely tokens.

Default: 10

number

Temperature for text generation. Lower values (0.0-0.5) increase factual accuracy.

Default: 0

integer

Top-k for text generation. Limits the token selection to the k most likely tokens.

Default: 5

number

Repetition penalty for audio. Values > 1.0 discourage repetition in audio generation.

Default: 1

integer

Window size for audio repetition penalty calculation.

Default: 64

number

Repetition penalty for text. Values > 1.0 discourage repetition in text generation.

Default: 1

integer

Window size for text repetition penalty calculation.

Default: 16

Output

json_str

I'm just an AI, I don't have feelings or experiences, so I don't have good or bad days. But I'm always here to help you with your questions and tasks!

media_path

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in

Run time and cost

This model costs approximately $0.0012 to run on Replicate, or 833 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 2 seconds.

Readme

Kimi-Audio-7B-Instruct: universal audio model 🔊 (Cog implementation)

Replicate

This Replicate model runs Kimi-Audio-7B-Instruct, Moonshot AI’s open-source, seven billion-parameter audio model. It listens to any sound, understands what’s happening, and can answer in text or speech. The same checkpoint handles speech-to-text, audio question answering, audio captioning, emotion recognition, sound-event classification, and two-way voice chat. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face, [2504.18425] Kimi-Audio Technical Report - arXiv.org)

GitHub: https://github.com/MoonshotAI/Kimi-Audio (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Technical report: arXiv 2504.18425 ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Hugging Face weights: moonshotai/Kimi-Audio-7B-Instruct (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)


About the model

Kimi-Audio turns raw audio into continuous acoustic features and discrete semantic tokens, then feeds both to a Qwen 2.5 transformer with parallel heads for text and audio generation. A flow-matching vocoder streams 24 kHz speech with about 300 milliseconds of latency. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)

The model was pre-trained on more than thirteen million hours of speech, music, and everyday sounds, then fine-tuned with conversation data so it follows chat prompts. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, How to Install Kimi-Audio 7B Instruct Locally - DEV Community)


Key features


Replicate packaging ⚙️

Component Source
Transformer weights moonshotai/Kimi-Audio-7B-Instruct (≈ 9.8 GB, bf16) (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Tokenizers + vocoder Bundled in the GitHub repo (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Docker base image moonshotai/kimi-audio:v0.1 (Kimi-Audio/Dockerfile at master · MoonshotAI/Kimi-Audio - GitHub)

The Cog container caches weights under /model_cache and sets HF_HOME and TORCH_HOME so the files are reused across runs.

predict.py flow

  1. Load weights, tokenizers, and detokenizer onto the GPU.
  2. Input accepts an audio file or URL plus an optional text prompt. You can tweak temperatures, top-k values, and the random seed.
  3. Generate text only, audio only, or both, using the model’s generate method.
  4. Return a JSON payload with a path to the WAV file (if speech was requested) and the generated text.

Expect about 24 GB of GPU memory for full-precision weights, or roughly 8 GB with 4-bit quantization at slower speed. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)


How it works under the hood


Use cases

  • Build a voice assistant that actually answers instead of reading web snippets.
  • Add live captions and emotion tags to video calls.
  • Monitor factory sounds for unusual events.
  • Turn recorded meetings into searchable text.

Limitations


License and disclaimer

Kimi-Audio-7B-Instruct weights are MIT. Code that originated in Qwen 2.5 is Apache 2.0. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
You are responsible for any content you generate with this model. Follow local laws and the upstream license terms.


Citation

@misc{kimi_audio_2025,
  title        = {Kimi-Audio Technical Report},
  author       = {Kimi Team},
  year         = {2025},
  eprint       = {2504.18425},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Cog implementation managed by zsxkib.

Star the repo on GitHub once it’s live. ⭐

Follow me on Twitter/X