jaaari / zonos

Zonos-v0.1 by Zyphra, voice cloning, 5 languages and emotion control

  • Public
  • 727 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model costs approximately $0.081 to run on Replicate, or 12 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 84 seconds. The predict time for this model varies significantly based on the inputs.

Readme

license: apache-2.0 language: - en - de - fr - ja - cmn pipeline_tag: text-to-speech


Disclaimer

Important: I implemented automatic splitting of long texts, but the model is prone to artifacts at the start and end of chunks, so for long texts those might show up “in the middle” too. The speaking rate is entangled to the input audio duration, so try to use around 30 seconds of input audio!

This is an implementation of the tts model from Zyphra, based on the inference repo govpro-ai/cog-zonos, in order to provide easy inference on Replicate. I am not affiliated with the original Zonos authors, and this is not an official release of the Zonos model. This implementation enables multi language support as well as emotion input. See the original README below for more details.


Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz. For more details and speech samples, check out our blog here We also have a hosted version available at maia.zyphra.com/audio

Usage

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)

codes = model.generate(conditioning)

wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

Gradio interface (recommended)

uv run gradio_interface.py
# python gradio_interface.py

This should produce a sample.wav file in your project root directory.

For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run. Features

Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
Fast: our model runs with a real-time factor of ~2x on an RTX 4090
Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.