afiaka87 / tortoise-tts

Generate speech from text, clone voices from mp3 files. From James Betker AKA "neonbjb".

  • Public
  • 169.5K runs
  • T4
  • GitHub
  • Paper
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
string
Shift + Return to add a new line

Text to speak.

Default: "The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them."

string

Selects the voice to use for generation. Use `random` to select a random voice. Use `custom_voice` to use a custom voice.

Default: "random"

file

(Optional) Create a custom voice based on an mp3 file of a speaker. Audio should be at least 15 seconds, only contain one speaker, and be in mp3 format. Overrides the `voice_a` input.

string

(Optional) Create new voice from averaging the latents for `voice_a`, `voice_b` and `voice_c`. Use `disabled` to disable voice mixing.

Default: "disabled"

string

(Optional) Create new voice from averaging the latents for `voice_a`, `voice_b` and `voice_c`. Use `disabled` to disable voice mixing.

Default: "disabled"

string

Which voice preset to use. See the documentation for more information.

Default: "fast"

integer

Random seed which can be used to reproduce results.

Default: 0

number
(minimum: 0, maximum: 1)

How much the CVVP model should influence the output. Increasing this can in some cases reduce the likelyhood of multiple speakers. Defaults to 0 (disabled)

Default: 0

Output

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in

Run time and cost

This model costs approximately $0.076 to run on Replicate, or 13 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 6 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Generate speech from text, clone voices from mp3 files. From James Betker AKA “neonbjb”.