Run time and cost

This model costs approximately $0.031 to run on Replicate, or 32 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 138 seconds. The predict time for this model varies significantly based on the inputs.

Readme

StyleTTS 2

StyleTTS 2 is a text-to-speech model that can generate speech from text and text + a reference speech to copy its style (speaker adaptation). See the original repository and paper for details.

API Usage

To use the model, simply provide the text you would like to generate speech for sound files as input. Optionally provide a reference speech (.mp3 or .wav) between 2-8 seconds for speaker adaptation. The API returns an .mp3 file with generated speech.

Input parameters are as follows:
- text: Text to convert to speech.
- reference: (optional) Reference speech to copy style from.
- alpha: Only used for long text inputs or in case of reference speaker, determines the timbre of the speaker. Use lower values to sample style based on previous or reference speech instead of text.
- beta: Only used for long text inputs or in case of reference speaker, determines the prosody of the speaker. Use lower values to sample style based on previous or reference speech instead of text.
- diffusion_steps: Number of diffusion steps.
- embedding_scale: Embedding scale, use higher values for pronounced emotion.

References

@article{Li2023StyleTTS2T,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
  journal={ArXiv},
  year={2023},
  volume={abs/2306.07691},
  url={https://api.semanticscholar.org/CorpusID:259145293}
}