Readme
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
TL;DR
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.
Acknowledgement
We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.
Citation
@article{peng2024voicecraft,
author = {Peng, Puyuan and Huang, Po-Yao and Li, Daniel and Mohamed, Abdelrahman and Harwath, David},
title = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
journal = {arXiv},
year = {2024},
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone’s speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.