Readme
Cog Implementation of PolyAI-LDN/pheme
Pheme Model
This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: Pheme: Efficient and Conversational Speech Generation. Demo is available here, while a selection of audio samples can be found here.
Our Pheme TTS framework validates several hypotheses:
- We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).
- Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
- Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
- One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
- Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
- The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.
@misc{budzianowski2024pheme,
title={Pheme: Efficient and Conversational Speech Generation},
author={Paweł Budzianowski and Taras Sereda and Tomasz Cichy and Ivan Vulić},
year={2024},
eprint={2401.02839},
archivePrefix={arXiv},
primaryClass={eess.AS}
}