Readme
Kokoro: A Frontier TTS Model
Note
Kokoro v0.19 can output a max of 30 seconds of audio per generation.
Model Card
Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).
On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.
In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in TTS Spaces Arena. Kokoro achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
- Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio
- XTTS v2: 467M, CPML, >10k hours
- Edge TTS: Microsoft, proprietary
- MetaVoice: 1.2B, Apache, 100k hours
- Parler Mini: 880M, Apache, 45k hours
- Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
Kokoro’s ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.
Acknowledgements
- @yl4579 for architecting StyleTTS 2
- @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena
Model Card Contact
@rzvzn on Discord
Server invite: https://discord.gg/QuGxSWBfQy