chenxwh / voicecraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

  • Public
  • 5.2K runs
  • GitHub
  • Paper
  • License



Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 17 seconds. The predict time for this model varies significantly based on the inputs.


VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Demo Paper


VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.


We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.


  author    = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},


Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone’s speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.