adirik / styletts2

Generates speech from text

If you haven’t yet trained a model on Replicate, we recommend you read one of the following guides.

Pricing

Trainings for this model run on 8x Nvidia A40 (Large) GPU hardware, which costs $0.0058 per second.

Create a training

Install the Python library:

pip install replicate

Then, run this to create a training with adirik/styletts2:989cb5ea as the base model:

import replicate

training = replicate.trainings.create(
  version="adirik/styletts2:989cb5ea6d2401314eb30685740cb9f6fd1c9001b8940659b406f952837ab5ac",
  input={
    ...
  },
  destination=f"{username}/<destination-model-name>"
)

print(training)
curl -s -X POST \
-d '{"destination": "{username}/<destination-model-name>", "input": {...}}' \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  https://api.replicate.com/v1/models/adirik/styletts2/versions/989cb5ea6d2401314eb30685740cb9f6fd1c9001b8940659b406f952837ab5ac/trainings

The API response will look like this:

{
  "id": "zz4ibbonubfz7carwiefibzgga",
  "version": "989cb5ea6d2401314eb30685740cb9f6fd1c9001b8940659b406f952837ab5ac",
  "status": "starting",
  "input": {
    "data": "..."
  },
  "output": null,
  "error": null,
  "logs": null,
  "started_at": null,
  "created_at": "2023-03-28T21:47:58.566434Z",
  "completed_at": null
}

Note that before you can create a training, you’ll need to create a model and use its name as the value for the destination field.

Fine Tuning with Your Own Data

You can use the train endpoint to fine tune the model on new speakers and perform inference with the fine tuned model by providing the url to the weights.

Input parameters are as follows:
- dataset: Url to .zip file containing the dataset. It must contain a wavs folder containing wav files with 24kHz sample rate, a train_data.txt file containing training data and a validation_data.txt file containing validation data. If SLM adversarial training is desired, it must also contain a OOD_data.txt file containing out-of-distribution texts for SLM adversarial training.

The dataset must be a zip file whose structure is as follows:

├── wavs
│   ├── 1.wav
│   ├── 2.wav
│   ├── 3.wav
├── train_data.txt
├── validation_data.txt
├── OOD_data.txt

train_data.txt and “validation_data.txt” should have wav file name|transcription|speaker id in each line. A sample train_data.txt file would be as follows:

1.wav|ðɪs ɪz ðə fɜːst ˈsɑːmpᵊl.|0
2.wav|ðɪs ɪz ðə ˈsɛkənd ˈsɑːmpᵊl.|0
3.wav|ðɪs ɪz ðə θɜːd ˈsɑːmpᵊl.|1

OOD_data.txt should have transcription|speaker id or wav file name|transcription|speaker id in each line. A sample OOD_data.txt file would be as follows:

fɜːst ˈsɑːmpᵊl.|0
ˈsɛkənd ˈsɑːmpᵊl.|0
θɜːd ˈsɑːmpᵊl.|1
  • num_train_epochs: Number of epochs to train.
  • style_diff_starting_epoch: Epoch to start style diffusion.
  • joint_training_starting_epoch: Epoch to start SLM advesariral training. If set to a value larger than num_train_epochs, SLM adversarial training will not be performed.
  • batch_size: Batch size.
  • min_length_ood: Minimum length of OOD texts for training. This is used to facilitate that the synthesized speech has a minimum length.
  • max_len_audio: Maximum audio length during training (in frames). Given that 300 is the default hop size, one frame is roughly 300 / 24000 (0.0125) second. If an out-of-memory error happens, try with lower value.