cjwbw / voicecraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

  • Public
  • 10K runs
  • L40S
  • GitHub
  • Paper
  • License

Input

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
string

Choose a task

Default: "zero-shot text-to-speech"

string

Choose a model

Default: "giga330M_TTSEnhanced.pth"

*file

Original audio file

string
Shift + Return to add a new line

Optionally provide the transcript of the input audio. Leave it blank to use the WhisperX model below to generate the transcript. Inaccurate transcription may lead to error TTS or speech editing

Default: ""

string

If orig_transcript is not provided above, choose a WhisperX model for generating the transcript. Inaccurate transcription may lead to error TTS or speech editing. You can modify the generated transcript and provide it directly to orig_transcript above

Default: "base.en"

*string
Shift + Return to add a new line

Transcript of the target audio file

number

Only used for for zero-shot text-to-speech task. The first seconds of the original audio that are used for zero-shot text-to-speech. 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec

Default: 3.01

integer

Set to 0 to use less VRAM, but with slower inference

Default: 1

number

Margin to the left of the editing segment

Default: 0.08

number

Margin to the right of the editing segment

Default: 0.08

number

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic. Do not recommend to change

Default: 1

number

Default value for TTS is 0.9, and 0.8 for speech editing

Default: 0.9

integer

Default value for TTS is 3, and -1 for speech editing. -1 means do not adjust prob of silence tokens. if there are long silence or unnaturally stretched words, increase sample_batch_size to 2, 3 or even 4

Default: 3

integer

Default value for TTS is 4, and 1 for speech editing. The higher the number, the faster the output will be. Under the hood, the model will generate this many samples and choose the shortest one

Default: 4

integer

Random seed. Leave blank to randomize the seed

Output

generated_audio

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x

whisper_transcript_orig_audio

But when I had approached so near to them, the common object, which the sense deceives, lost not by distance any of its marks.
Generated in

This output was created using a different version of the model, cjwbw/voicecraft:6e42571a.

Run time and cost

This model costs approximately $0.0063 to run on Replicate, or 158 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 7 seconds.

Readme

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Demo Paper

TL;DR

VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.

Acknowledgement

We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.

Citation

@article{peng2024voicecraft,
  author    = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone’s speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.