Readme

Kokoro: A Frontier TTS Model

Note

Kokoro v0.19 can output a max of 30 seconds of audio per generation.

Model Card

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.

In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in TTS Spaces Arena. Kokoro achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:

Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio
XTTS v2: 467M, CPML, >10k hours
Edge TTS: Microsoft, proprietary
MetaVoice: 1.2B, Apache, 100k hours
Parler Mini: 880M, Apache, 45k hours
Fish Speech: ~500M, CC-BY-NC-SA, 1M hours

Kokoro’s ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.

Acknowledgements

@yl4579 for architecting StyleTTS 2
@Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena

Model Card Contact

@rzvzn on Discord
Server invite: https://discord.gg/QuGxSWBfQy

Model created over 1 year ago