cjwbw / voicecraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

  • Public
  • 9.2K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0057 to run on Replicate, or 175 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 8 seconds.

Readme

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Demo Paper

TL;DR

VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.

Acknowledgement

We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.

Citation

@article{peng2024voicecraft,
  author    = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone’s speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.