lucataco / pheme

Pheme generates a variety of conversational voices in 16 kHz for phone-call applications

  • Public
  • 361 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 17 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Cog Implementation of PolyAI-LDN/pheme

Pheme Model

This repo contains recipes and models used for training Pheme TTS models. It is the official implementation for the paper: Pheme: Efficient and Conversational Speech Generation. Demo is available here, while a selection of audio samples can be found here.

Our Pheme TTS framework validates several hypotheses:

  1. We can train Transformer-based conversational TTS models with much fewer training data than e.g., VALL-E or SoundStorm (e.g., 10x fewer data).
  2. Training can be performed with conversational, podcast, and noisy data like GigaSpeech.
  3. Efficiency is paramount, which includes parameter efficiency (compact models), data efficiency (fewer training data) and inference efficiency (reduced latency).
  4. One fundamental ingredient is the separation of semantic and acoustic tokens and the adequate speech tokenizer.
  5. Inference can be run parallelly through MaskGit-style inference with 15x speed-ups compared to similarly sized autoregressive models.
  6. The single-speaker quality can be improved through student-teacher training with (synthetic) data generated by third-party providers.
@misc{budzianowski2024pheme,
      title={Pheme: Efficient and Conversational Speech Generation}, 
      author={Paweł Budzianowski and Taras Sereda and Tomasz Cichy and Ivan Vulić},
      year={2024},
      eprint={2401.02839},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}