adirik / hierspeechpp

Zero-shot speech synthesizer for text-to-speech and voice conversion

  • Public
  • 2.9K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.030 to run on Replicate, or 33 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 54 seconds. The predict time for this model varies significantly based on the inputs.

Readme

HierSpeech++

HierSpeech++ is a text-to-speech model that can generate speech from text and a target voice for zero-shot speech synthesis. See the original repository and paper for details.

API Usage

To use the model, simply provide the text you would like to generate speech and a sound file of your target voice as input. Optionally provide a reference speech (.mp3 or .wav) instead of text to parse speech content. The API returns an .mp3 file with generated speech.

Input parameters are as follows:
- input_text: (optional) text input to the model. If provided, it will be used for the speech content of the output.
- input_sound: (optional) sound input to the model. If provided, it will be used for the speech content of the output.
- target_voice: a voice clip containing the speaker to synthesize.
- denoise_ratio: noise control. 0 means no noise reduction, 1 means maximum noise reduction. If noise reduction is desired, it is recommended to set this value to 0.6~0.8.
- text_to_vector_temperature: temperature for text-to-vector model. Larger value corresponds to slightly more random output.
- output_sample_rate: sample rate of the output audio file.
- scale_output_volume: scale normalization. If set to true, the output audio will be scaled according to the input sound if provided.
- seed: random seed to use for reproducibility.

References

@article{Lee2023HierSpeechBT,
  title={HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis},
  author={Sang-Hoon Lee and Haram Choi and Seung-Bin Kim and Seong-Whan Lee},
  journal={ArXiv},
  year={2023},
  volume={abs/2311.12454},
  url={https://api.semanticscholar.org/CorpusID:265308903}
}