Readme
license: apache-2.0 language: - en - de - fr - ja - cmn pipeline_tag: text-to-speech
Disclaimer
Important: I implemented automatic splitting of long texts, but the model is prone to artifacts at the start and end of chunks, so for long texts those might show up “in the middle” too. The speaking rate is entangled to the input audio duration, so try to use around 30 seconds of input audio!
This is an implementation of the tts model from Zyphra, based on the inference repo govpro-ai/cog-zonos, in order to provide easy inference on Replicate. I am not affiliated with the original Zonos authors, and this is not an official release of the Zonos model. This implementation enables multi language support as well as emotion input. See the original README below for more details.
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.
Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz. For more details and speech samples, check out our blog here We also have a hosted version available at maia.zyphra.com/audio
Usage
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)
cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
Gradio interface (recommended)
uv run gradio_interface.py
# python gradio_interface.py
This should produce a sample.wav file in your project root directory.
For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run. Features
Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
Fast: our model runs with a real-time factor of ~2x on an RTX 4090
Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.