kjjk10 / kokoro-82m

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

  • Public
  • 79 runs
  • Weights
  • License

Run time and cost

This model runs on Nvidia T4 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Kokoro: A Frontier TTS Model

Note

Kokoro v0.19 can output a max of 30 seconds of audio per generation.


Model Card

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.

In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in TTS Spaces Arena. Kokoro achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:

  • Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio
  • XTTS v2: 467M, CPML, >10k hours
  • Edge TTS: Microsoft, proprietary
  • MetaVoice: 1.2B, Apache, 100k hours
  • Parler Mini: 880M, Apache, 45k hours
  • Fish Speech: ~500M, CC-BY-NC-SA, 1M hours

Kokoro’s ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.


Acknowledgements

  • @yl4579 for architecting StyleTTS 2
  • @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena

Model Card Contact

@rzvzn on Discord
Server invite: https://discord.gg/QuGxSWBfQy