jaaari / zonos

Zonos-v0.1 by Zyphra, voice cloning, 5 languages and emotion control

Cold

Public
1.5K runs
L40S
GitHub
Weights
Paper
License

Iterate in playground

Run with an API

Playground API Examples Train README Versions

Input

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

text

*string

Shift + Return to add a new line

Hi! I'm Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!Hi! I'm Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!

Text to generate speech from

audio

file

Path to audio file for voice cloning (optional)

language

string

Language code for speech generation

Default: "en-us"

model_type

string

Model type to use

Default: "transformer"

emotion

string

Shift + Return to add a new line

Optionally pass a comma-separated list of 8 floats for your desired emotion vector in the order [Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral]. For example: '0.5,0.2,0.0,0.0,0.3,0.1,0.0,0.0'. If empty or invalid, defaults to the built-in neutralish emotion.

Default: ""

speaking_rate

number

(minimum: 5, maximum: 30)

Speaking rate in phonemes per second. Default is 15.0. 10-12 is slow and clear, 15-17 is natural conversational, 20+ is fast. Values above 30 may produce artifacts.

Default: 15

seed

integer

Seed for reproducibility (optional)

Run this model in Node.js with one line of code:

npx create-replicate --model=jaaari/zonos

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";
import fs from "node:fs";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run jaaari/zonos using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "jaaari/zonos:79caaf88e47605d71197442eb35361be922488dfb2d55de8ae757cc73d6d2a15",
  {
    input: {
      seed: 1,
      text: "Hi! I'm Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
      audio: "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
      emotion: "",
      language: "en-us",
      model_type: "transformer",
      speaking_rate: 15
    }
  }
);

// To access the file URL:
console.log(output.url()); //=> "http://example.com"

// To write the file to disk:
fs.writeFile("my-image.png", output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run jaaari/zonos using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "jaaari/zonos:79caaf88e47605d71197442eb35361be922488dfb2d55de8ae757cc73d6d2a15",
    input={
        "seed": 1,
        "text": "Hi! I'm Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
        "audio": "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
        "emotion": "",
        "language": "en-us",
        "model_type": "transformer",
        "speaking_rate": 15
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run jaaari/zonos using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "jaaari/zonos:79caaf88e47605d71197442eb35361be922488dfb2d55de8ae757cc73d6d2a15",
    "input": {
      "seed": 1,
      "text": "Hi! I\'m Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
      "audio": "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
      "emotion": "",
      "language": "en-us",
      "model_type": "transformer",
      "speaking_rate": 15
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

{
  "completed_at": "2025-02-12T15:02:09.409340Z",
  "created_at": "2025-02-12T14:59:03.731000Z",
  "data_removed": false,
  "error": null,
  "id": "4t3cw6dhpdrme0cmzansz3set8",
  "input": {
    "seed": 1,
    "text": "Hi! I'm Zonos, a text-to-speech model build by Zyphra. I can speak 5 languages and you can even control my emotions. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
    "audio": "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
    "emotion": "",
    "language": "en-us",
    "model_type": "transformer",
    "speaking_rate": 15
  },
  "logs": "0%|          | 0/3060 [00:00<?, ?it/s]\n  0%|          | 1/3060 [00:00<09:04,  5.61it/s]\n  1%|          | 16/3060 [00:00<00:44, 68.83it/s]\n  1%|          | 32/3060 [00:00<00:29, 102.22it/s]\n  2%|▏         | 48/3060 [00:00<00:24, 120.62it/s]\n  2%|▏         | 64/3060 [00:00<00:22, 131.66it/s]\n  3%|▎         | 80/3060 [00:00<00:21, 138.61it/s]\n  3%|▎         | 96/3060 [00:00<00:20, 143.09it/s]\n  4%|▎         | 112/3060 [00:00<00:20, 146.08it/s]\n  4%|▍         | 128/3060 [00:01<00:19, 148.11it/s]\n  5%|▍         | 144/3060 [00:01<00:19, 149.55it/s]\n  5%|▌         | 160/3060 [00:01<00:19, 150.42it/s]\n  6%|▌         | 176/3060 [00:01<00:19, 150.98it/s]\n  6%|▋         | 192/3060 [00:01<00:18, 151.36it/s]\n  7%|▋         | 208/3060 [00:01<00:18, 151.66it/s]\n  7%|▋         | 224/3060 [00:01<00:18, 151.82it/s]\n  8%|▊         | 240/3060 [00:01<00:18, 151.85it/s]\n  8%|▊         | 256/3060 [00:01<00:18, 151.88it/s]\n  9%|▉         | 272/3060 [00:01<00:18, 151.90it/s]\n  9%|▉         | 288/3060 [00:02<00:18, 151.93it/s]\n 10%|▉         | 304/3060 [00:02<00:18, 151.81it/s]\n 10%|█         | 320/3060 [00:02<00:18, 151.66it/s]\n 11%|█         | 336/3060 [00:02<00:17, 151.57it/s]\n 12%|█▏        | 352/3060 [00:02<00:17, 151.49it/s]\n 12%|█▏        | 368/3060 [00:02<00:17, 151.43it/s]\n 13%|█▎        | 384/3060 [00:02<00:17, 151.38it/s]\n 13%|█▎        | 400/3060 [00:02<00:17, 151.35it/s]\n 14%|█▎        | 416/3060 [00:02<00:17, 151.25it/s]\n 14%|█▍        | 432/3060 [00:03<00:17, 151.09it/s]\n 15%|█▍        | 448/3060 [00:03<00:17, 150.94it/s]\n 15%|█▌        | 464/3060 [00:03<00:17, 150.80it/s]\n 16%|█▌        | 480/3060 [00:03<00:17, 150.74it/s]\n 16%|█▌        | 496/3060 [00:03<00:17, 150.64it/s]\n 17%|█▋        | 512/3060 [00:03<00:16, 150.55it/s]\n 17%|█▋        | 528/3060 [00:03<00:16, 150.42it/s]\n 18%|█▊        | 544/3060 [00:03<00:16, 150.34it/s]\n 18%|█▊        | 560/3060 [00:03<00:16, 150.22it/s]\n 19%|█▉        | 576/3060 [00:03<00:16, 150.11it/s]\n 19%|█▉        | 592/3060 [00:04<00:16, 150.00it/s]\n 20%|█▉        | 607/3060 [00:04<00:16, 149.89it/s]\n 20%|██        | 622/3060 [00:04<00:16, 149.83it/s]\n 21%|██        | 637/3060 [00:04<00:16, 149.79it/s]\n 21%|██▏       | 652/3060 [00:04<00:16, 149.72it/s]\n 22%|██▏       | 667/3060 [00:04<00:16, 149.44it/s]\n 22%|██▏       | 682/3060 [00:04<00:15, 149.37it/s]\n 23%|██▎       | 697/3060 [00:04<00:15, 149.25it/s]\n 23%|██▎       | 712/3060 [00:04<00:15, 149.18it/s]\n 24%|██▍       | 727/3060 [00:04<00:15, 149.14it/s]\n 24%|██▍       | 742/3060 [00:05<00:15, 149.09it/s]\n 25%|██▍       | 757/3060 [00:05<00:15, 149.04it/s]\n 25%|██▌       | 772/3060 [00:05<00:15, 149.02it/s]\n 26%|██▌       | 787/3060 [00:05<00:15, 148.98it/s]\n 26%|██▌       | 802/3060 [00:05<00:15, 148.90it/s]\n 27%|██▋       | 817/3060 [00:05<00:15, 148.74it/s]\n 27%|██▋       | 832/3060 [00:05<00:14, 148.57it/s]\n 28%|██▊       | 847/3060 [00:05<00:14, 148.45it/s]\n 28%|██▊       | 862/3060 [00:05<00:14, 148.32it/s]\n 29%|██▊       | 877/3060 [00:05<00:14, 148.19it/s]\n 29%|██▉       | 892/3060 [00:06<00:14, 148.10it/s]\n 30%|██▉       | 907/3060 [00:06<00:14, 148.05it/s]\n 30%|███       | 922/3060 [00:06<00:14, 147.93it/s]\n 31%|███       | 937/3060 [00:06<00:14, 147.88it/s]\n 31%|███       | 952/3060 [00:06<00:14, 147.78it/s]\n 32%|███▏      | 967/3060 [00:06<00:14, 147.70it/s]\n 32%|███▏      | 982/3060 [00:06<00:14, 147.57it/s]\n 33%|███▎      | 997/3060 [00:06<00:13, 147.49it/s]\n 33%|███▎      | 1012/3060 [00:06<00:13, 147.44it/s]\n 34%|███▎      | 1027/3060 [00:07<00:13, 147.40it/s]\n 34%|███▍      | 1042/3060 [00:07<00:13, 147.26it/s]\n 35%|███▍      | 1057/3060 [00:07<00:13, 147.20it/s]\n 35%|███▌      | 1072/3060 [00:07<00:13, 147.12it/s]\n 36%|███▌      | 1087/3060 [00:07<00:13, 147.02it/s]\n 36%|███▌      | 1102/3060 [00:07<00:13, 146.94it/s]\n 37%|███▋      | 1117/3060 [00:07<00:13, 146.86it/s]\n 37%|███▋      | 1132/3060 [00:07<00:13, 146.77it/s]\n 37%|███▋      | 1147/3060 [00:07<00:13, 146.71it/s]\n 38%|███▊      | 1162/3060 [00:07<00:12, 146.66it/s]\n 38%|███▊      | 1177/3060 [00:08<00:12, 146.58it/s]\n 39%|███▉      | 1192/3060 [00:08<00:12, 146.52it/s]\n 39%|███▉      | 1207/3060 [00:08<00:12, 146.35it/s]\n 40%|███▉      | 1222/3060 [00:08<00:12, 146.26it/s]\n 40%|████      | 1237/3060 [00:08<00:12, 146.18it/s]\n 41%|████      | 1252/3060 [00:08<00:12, 146.11it/s]\n 41%|████▏     | 1267/3060 [00:08<00:12, 146.07it/s]\n 42%|████▏     | 1282/3060 [00:08<00:12, 146.00it/s]\n 42%|████▏     | 1297/3060 [00:08<00:12, 145.93it/s]\n 43%|████▎     | 1312/3060 [00:08<00:11, 145.85it/s]\n 43%|████▎     | 1327/3060 [00:09<00:11, 145.76it/s]\n 44%|████▍     | 1342/3060 [00:09<00:11, 145.63it/s]\n 44%|████▍     | 1357/3060 [00:09<00:11, 145.52it/s]\n 45%|████▍     | 1372/3060 [00:09<00:11, 145.44it/s]\n 45%|████▌     | 1387/3060 [00:09<00:11, 145.31it/s]\n 46%|████▌     | 1402/3060 [00:09<00:11, 145.21it/s]\n 46%|████▋     | 1417/3060 [00:09<00:11, 145.05it/s]\n 47%|████▋     | 1432/3060 [00:09<00:11, 144.97it/s]\n 47%|████▋     | 1447/3060 [00:09<00:11, 144.95it/s]\n 48%|████▊     | 1462/3060 [00:09<00:11, 144.80it/s]\n 48%|████▊     | 1477/3060 [00:10<00:10, 144.70it/s]\n 49%|████▉     | 1492/3060 [00:10<00:10, 144.61it/s]\n 49%|████▉     | 1507/3060 [00:10<00:10, 144.55it/s]\n 50%|████▉     | 1522/3060 [00:10<00:10, 144.51it/s]\n 50%|█████     | 1537/3060 [00:10<00:10, 144.48it/s]\n 51%|█████     | 1552/3060 [00:10<00:10, 144.43it/s]\n 51%|█████     | 1567/3060 [00:10<00:10, 144.37it/s]\n 52%|█████▏    | 1582/3060 [00:10<00:10, 144.25it/s]\n 52%|█████▏    | 1597/3060 [00:10<00:10, 144.08it/s]\n 53%|█████▎    | 1612/3060 [00:11<00:10, 143.96it/s]\n 53%|█████▎    | 1627/3060 [00:11<00:09, 143.84it/s]\n 54%|█████▎    | 1642/3060 [00:11<00:09, 143.74it/s]\n 54%|█████▍    | 1657/3060 [00:11<00:09, 143.67it/s]\n 55%|█████▍    | 1672/3060 [00:11<00:09, 143.55it/s]\n 55%|█████▌    | 1687/3060 [00:11<00:09, 143.46it/s]\n 56%|█████▌    | 1702/3060 [00:11<00:09, 143.45it/s]\n 56%|█████▌    | 1717/3060 [00:11<00:09, 140.78it/s]\n 57%|█████▋    | 1732/3060 [00:11<00:09, 138.97it/s]\n 57%|█████▋    | 1746/3060 [00:11<00:09, 137.73it/s]\n 58%|█████▊    | 1760/3060 [00:12<00:09, 136.83it/s]\n 58%|█████▊    | 1774/3060 [00:12<00:09, 136.20it/s]\n 58%|█████▊    | 1788/3060 [00:12<00:09, 135.70it/s]\n 59%|█████▉    | 1802/3060 [00:12<00:09, 135.33it/s]\n 59%|█████▉    | 1816/3060 [00:12<00:09, 135.07it/s]\n 60%|█████▉    | 1830/3060 [00:12<00:09, 134.86it/s]\n 60%|██████    | 1844/3060 [00:12<00:09, 134.85it/s]\n 61%|██████    | 1858/3060 [00:12<00:08, 134.82it/s]\n 61%|██████    | 1872/3060 [00:12<00:08, 134.81it/s]\n 62%|██████▏   | 1886/3060 [00:13<00:08, 134.76it/s]\n 62%|██████▏   | 1900/3060 [00:13<00:08, 134.67it/s]\n 63%|██████▎   | 1914/3060 [00:13<00:08, 134.58it/s]\n 63%|██████▎   | 1928/3060 [00:13<00:08, 134.53it/s]\n 63%|██████▎   | 1942/3060 [00:13<00:08, 134.46it/s]\n 64%|██████▍   | 1956/3060 [00:13<00:08, 134.40it/s]\n 64%|██████▍   | 1970/3060 [00:13<00:08, 134.44it/s]\n 65%|██████▍   | 1984/3060 [00:13<00:07, 134.51it/s]\n 65%|██████▌   | 1998/3060 [00:13<00:07, 134.55it/s]\n 66%|██████▌   | 2012/3060 [00:13<00:07, 134.55it/s]\n 66%|██████▌   | 2026/3060 [00:14<00:07, 134.56it/s]\n 67%|██████▋   | 2040/3060 [00:14<00:07, 134.51it/s]\n 67%|██████▋   | 2054/3060 [00:14<00:07, 134.33it/s]\n 68%|██████▊   | 2068/3060 [00:14<00:07, 134.24it/s]\n 68%|██████▊   | 2082/3060 [00:14<00:07, 134.20it/s]\n 68%|██████▊   | 2096/3060 [00:14<00:07, 134.25it/s]\n 69%|██████▉   | 2110/3060 [00:14<00:07, 134.24it/s]\n 69%|██████▉   | 2124/3060 [00:14<00:06, 134.28it/s]\n 70%|██████▉   | 2138/3060 [00:14<00:06, 134.27it/s]\n 70%|███████   | 2152/3060 [00:15<00:06, 134.28it/s]\n 71%|███████   | 2166/3060 [00:15<00:06, 134.14it/s]\n 71%|███████   | 2180/3060 [00:15<00:06, 134.11it/s]\n 72%|███████▏  | 2194/3060 [00:15<00:06, 134.10it/s]\n72%|███████▏  | 2198/3060 [00:15<00:06, 143.10it/s]",
  "metrics": {
    "predict_time": 54.643024237,
    "total_time": 185.67834
  },
  "output": "https://replicate.delivery/xezq/QLlz4NNORoJKK9UUca3YOyVeHc9UEl1jl2rJtDgpLdk4TTHKA/sample.wav",
  "started_at": "2025-02-12T15:01:14.766316Z",
  "status": "succeeded",
  "urls": {
    "stream": "https://stream.replicate.com/v1/files/bcwr-m2fzaxhmwvbm3kpdnvxfl7rk2rj7mgivbscyxmnofywzsbt3psca",
    "get": "https://api.replicate.com/v1/predictions/4t3cw6dhpdrme0cmzansz3set8",
    "cancel": "https://api.replicate.com/v1/predictions/4t3cw6dhpdrme0cmzansz3set8/cancel"
  },
  "version": "c9eef8e25a4847118605fa9204e665478ddf2a8050d652f1dc081d4ba0dfb878"
}

Generated in

54.6 seconds

Tweak it ShareReport View full prediction

0%|          | 0/3060 [00:00<?, ?it/s]
  0%|          | 1/3060 [00:00<09:04,  5.61it/s]
  1%|          | 16/3060 [00:00<00:44, 68.83it/s]
  1%|          | 32/3060 [00:00<00:29, 102.22it/s]
  2%|▏         | 48/3060 [00:00<00:24, 120.62it/s]
  2%|▏         | 64/3060 [00:00<00:22, 131.66it/s]
  3%|▎         | 80/3060 [00:00<00:21, 138.61it/s]
  3%|▎         | 96/3060 [00:00<00:20, 143.09it/s]
  4%|▎         | 112/3060 [00:00<00:20, 146.08it/s]
  4%|▍         | 128/3060 [00:01<00:19, 148.11it/s]
  5%|▍         | 144/3060 [00:01<00:19, 149.55it/s]
  5%|▌         | 160/3060 [00:01<00:19, 150.42it/s]
  6%|▌         | 176/3060 [00:01<00:19, 150.98it/s]
  6%|▋         | 192/3060 [00:01<00:18, 151.36it/s]
  7%|▋         | 208/3060 [00:01<00:18, 151.66it/s]
  7%|▋         | 224/3060 [00:01<00:18, 151.82it/s]
  8%|▊         | 240/3060 [00:01<00:18, 151.85it/s]
  8%|▊         | 256/3060 [00:01<00:18, 151.88it/s]
  9%|▉         | 272/3060 [00:01<00:18, 151.90it/s]
  9%|▉         | 288/3060 [00:02<00:18, 151.93it/s]
 10%|▉         | 304/3060 [00:02<00:18, 151.81it/s]
 10%|█         | 320/3060 [00:02<00:18, 151.66it/s]
 11%|█         | 336/3060 [00:02<00:17, 151.57it/s]
 12%|█▏        | 352/3060 [00:02<00:17, 151.49it/s]
 12%|█▏        | 368/3060 [00:02<00:17, 151.43it/s]
 13%|█▎        | 384/3060 [00:02<00:17, 151.38it/s]
 13%|█▎        | 400/3060 [00:02<00:17, 151.35it/s]
 14%|█▎        | 416/3060 [00:02<00:17, 151.25it/s]
 14%|█▍        | 432/3060 [00:03<00:17, 151.09it/s]
 15%|█▍        | 448/3060 [00:03<00:17, 150.94it/s]
 15%|█▌        | 464/3060 [00:03<00:17, 150.80it/s]
 16%|█▌        | 480/3060 [00:03<00:17, 150.74it/s]
 16%|█▌        | 496/3060 [00:03<00:17, 150.64it/s]
 17%|█▋        | 512/3060 [00:03<00:16, 150.55it/s]
 17%|█▋        | 528/3060 [00:03<00:16, 150.42it/s]
 18%|█▊        | 544/3060 [00:03<00:16, 150.34it/s]
 18%|█▊        | 560/3060 [00:03<00:16, 150.22it/s]
 19%|█▉        | 576/3060 [00:03<00:16, 150.11it/s]
 19%|█▉        | 592/3060 [00:04<00:16, 150.00it/s]
 20%|█▉        | 607/3060 [00:04<00:16, 149.89it/s]
 20%|██        | 622/3060 [00:04<00:16, 149.83it/s]
 21%|██        | 637/3060 [00:04<00:16, 149.79it/s]
 21%|██▏       | 652/3060 [00:04<00:16, 149.72it/s]
 22%|██▏       | 667/3060 [00:04<00:16, 149.44it/s]
 22%|██▏       | 682/3060 [00:04<00:15, 149.37it/s]
 23%|██▎       | 697/3060 [00:04<00:15, 149.25it/s]
 23%|██▎       | 712/3060 [00:04<00:15, 149.18it/s]
 24%|██▍       | 727/3060 [00:04<00:15, 149.14it/s]
 24%|██▍       | 742/3060 [00:05<00:15, 149.09it/s]
 25%|██▍       | 757/3060 [00:05<00:15, 149.04it/s]
 25%|██▌       | 772/3060 [00:05<00:15, 149.02it/s]
 26%|██▌       | 787/3060 [00:05<00:15, 148.98it/s]
 26%|██▌       | 802/3060 [00:05<00:15, 148.90it/s]
 27%|██▋       | 817/3060 [00:05<00:15, 148.74it/s]
 27%|██▋       | 832/3060 [00:05<00:14, 148.57it/s]
 28%|██▊       | 847/3060 [00:05<00:14, 148.45it/s]
 28%|██▊       | 862/3060 [00:05<00:14, 148.32it/s]
 29%|██▊       | 877/3060 [00:05<00:14, 148.19it/s]
 29%|██▉       | 892/3060 [00:06<00:14, 148.10it/s]
 30%|██▉       | 907/3060 [00:06<00:14, 148.05it/s]
 30%|███       | 922/3060 [00:06<00:14, 147.93it/s]
 31%|███       | 937/3060 [00:06<00:14, 147.88it/s]
 31%|███       | 952/3060 [00:06<00:14, 147.78it/s]
 32%|███▏      | 967/3060 [00:06<00:14, 147.70it/s]
 32%|███▏      | 982/3060 [00:06<00:14, 147.57it/s]
 33%|███▎      | 997/3060 [00:06<00:13, 147.49it/s]
 33%|███▎      | 1012/3060 [00:06<00:13, 147.44it/s]
 34%|███▎      | 1027/3060 [00:07<00:13, 147.40it/s]
 34%|███▍      | 1042/3060 [00:07<00:13, 147.26it/s]
 35%|███▍      | 1057/3060 [00:07<00:13, 147.20it/s]
 35%|███▌      | 1072/3060 [00:07<00:13, 147.12it/s]
 36%|███▌      | 1087/3060 [00:07<00:13, 147.02it/s]
 36%|███▌      | 1102/3060 [00:07<00:13, 146.94it/s]
 37%|███▋      | 1117/3060 [00:07<00:13, 146.86it/s]
 37%|███▋      | 1132/3060 [00:07<00:13, 146.77it/s]
 37%|███▋      | 1147/3060 [00:07<00:13, 146.71it/s]
 38%|███▊      | 1162/3060 [00:07<00:12, 146.66it/s]
 38%|███▊      | 1177/3060 [00:08<00:12, 146.58it/s]
 39%|███▉      | 1192/3060 [00:08<00:12, 146.52it/s]
 39%|███▉      | 1207/3060 [00:08<00:12, 146.35it/s]
 40%|███▉      | 1222/3060 [00:08<00:12, 146.26it/s]
 40%|████      | 1237/3060 [00:08<00:12, 146.18it/s]
 41%|████      | 1252/3060 [00:08<00:12, 146.11it/s]
 41%|████▏     | 1267/3060 [00:08<00:12, 146.07it/s]
 42%|████▏     | 1282/3060 [00:08<00:12, 146.00it/s]
 42%|████▏     | 1297/3060 [00:08<00:12, 145.93it/s]
 43%|████▎     | 1312/3060 [00:08<00:11, 145.85it/s]
 43%|████▎     | 1327/3060 [00:09<00:11, 145.76it/s]
 44%|████▍     | 1342/3060 [00:09<00:11, 145.63it/s]
 44%|████▍     | 1357/3060 [00:09<00:11, 145.52it/s]
 45%|████▍     | 1372/3060 [00:09<00:11, 145.44it/s]
 45%|████▌     | 1387/3060 [00:09<00:11, 145.31it/s]
 46%|████▌     | 1402/3060 [00:09<00:11, 145.21it/s]
 46%|████▋     | 1417/3060 [00:09<00:11, 145.05it/s]
 47%|████▋     | 1432/3060 [00:09<00:11, 144.97it/s]
 47%|████▋     | 1447/3060 [00:09<00:11, 144.95it/s]
 48%|████▊     | 1462/3060 [00:09<00:11, 144.80it/s]
 48%|████▊     | 1477/3060 [00:10<00:10, 144.70it/s]
 49%|████▉     | 1492/3060 [00:10<00:10, 144.61it/s]
 49%|████▉     | 1507/3060 [00:10<00:10, 144.55it/s]
 50%|████▉     | 1522/3060 [00:10<00:10, 144.51it/s]
 50%|█████     | 1537/3060 [00:10<00:10, 144.48it/s]
 51%|█████     | 1552/3060 [00:10<00:10, 144.43it/s]
 51%|█████     | 1567/3060 [00:10<00:10, 144.37it/s]
 52%|█████▏    | 1582/3060 [00:10<00:10, 144.25it/s]
 52%|█████▏    | 1597/3060 [00:10<00:10, 144.08it/s]
 53%|█████▎    | 1612/3060 [00:11<00:10, 143.96it/s]
 53%|█████▎    | 1627/3060 [00:11<00:09, 143.84it/s]
 54%|█████▎    | 1642/3060 [00:11<00:09, 143.74it/s]
 54%|█████▍    | 1657/3060 [00:11<00:09, 143.67it/s]
 55%|█████▍    | 1672/3060 [00:11<00:09, 143.55it/s]
 55%|█████▌    | 1687/3060 [00:11<00:09, 143.46it/s]
 56%|█████▌    | 1702/3060 [00:11<00:09, 143.45it/s]
 56%|█████▌    | 1717/3060 [00:11<00:09, 140.78it/s]
 57%|█████▋    | 1732/3060 [00:11<00:09, 138.97it/s]
 57%|█████▋    | 1746/3060 [00:11<00:09, 137.73it/s]
 58%|█████▊    | 1760/3060 [00:12<00:09, 136.83it/s]
 58%|█████▊    | 1774/3060 [00:12<00:09, 136.20it/s]
 58%|█████▊    | 1788/3060 [00:12<00:09, 135.70it/s]
 59%|█████▉    | 1802/3060 [00:12<00:09, 135.33it/s]
 59%|█████▉    | 1816/3060 [00:12<00:09, 135.07it/s]
 60%|█████▉    | 1830/3060 [00:12<00:09, 134.86it/s]
 60%|██████    | 1844/3060 [00:12<00:09, 134.85it/s]
 61%|██████    | 1858/3060 [00:12<00:08, 134.82it/s]
 61%|██████    | 1872/3060 [00:12<00:08, 134.81it/s]
 62%|██████▏   | 1886/3060 [00:13<00:08, 134.76it/s]
 62%|██████▏   | 1900/3060 [00:13<00:08, 134.67it/s]
 63%|██████▎   | 1914/3060 [00:13<00:08, 134.58it/s]
 63%|██████▎   | 1928/3060 [00:13<00:08, 134.53it/s]
 63%|██████▎   | 1942/3060 [00:13<00:08, 134.46it/s]
 64%|██████▍   | 1956/3060 [00:13<00:08, 134.40it/s]
 64%|██████▍   | 1970/3060 [00:13<00:08, 134.44it/s]
 65%|██████▍   | 1984/3060 [00:13<00:07, 134.51it/s]
 65%|██████▌   | 1998/3060 [00:13<00:07, 134.55it/s]
 66%|██████▌   | 2012/3060 [00:13<00:07, 134.55it/s]
 66%|██████▌   | 2026/3060 [00:14<00:07, 134.56it/s]
 67%|██████▋   | 2040/3060 [00:14<00:07, 134.51it/s]
 67%|██████▋   | 2054/3060 [00:14<00:07, 134.33it/s]
 68%|██████▊   | 2068/3060 [00:14<00:07, 134.24it/s]
 68%|██████▊   | 2082/3060 [00:14<00:07, 134.20it/s]
 68%|██████▊   | 2096/3060 [00:14<00:07, 134.25it/s]
 69%|██████▉   | 2110/3060 [00:14<00:07, 134.24it/s]
 69%|██████▉   | 2124/3060 [00:14<00:06, 134.28it/s]
 70%|██████▉   | 2138/3060 [00:14<00:06, 134.27it/s]
 70%|███████   | 2152/3060 [00:15<00:06, 134.28it/s]
 71%|███████   | 2166/3060 [00:15<00:06, 134.14it/s]
 71%|███████   | 2180/3060 [00:15<00:06, 134.11it/s]
 72%|███████▏  | 2194/3060 [00:15<00:06, 134.10it/s]
72%|███████▏  | 2198/3060 [00:15<00:06, 143.10it/s]

This output was created using a different version of the model, jaaari/zonos:c9eef8e2.

Run time and cost

This model costs approximately $0.015 to run on Replicate, or 66 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 16 seconds. The predict time for this model varies significantly based on the inputs.

Readme

license: apache-2.0 language: - en - de - fr - ja - cmn pipeline_tag: text-to-speech

Disclaimer

Important: I implemented automatic splitting of long texts, but the model is prone to artifacts at the start and end of chunks, so for long texts those might show up “in the middle” too. The speaking rate is entangled to the input audio duration, so try to use around 30 seconds of input audio!

This is an implementation of the tts model from Zyphra, based on the inference repo govpro-ai/cog-zonos, in order to provide easy inference on Replicate. I am not affiliated with the original Zonos authors, and this is not an official release of the Zonos model. This implementation enables multi language support as well as emotion input. See the original README below for more details.

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz. For more details and speech samples, check out our blog here We also have a hosted version available at maia.zyphra.com/audio

Usage

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)

codes = model.generate(conditioning)

wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

Gradio interface (recommended)

uv run gradio_interface.py
# python gradio_interface.py

This should produce a sample.wav file in your project root directory.

For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run. Features

Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output
Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings
Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German
Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
Fast: our model runs with a real-time factor of ~2x on an RTX 4090
Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech
Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.