geopti/chatterbox-multilingual

Public
564 runs

Run geopti/chatterbox-multilingual with an API

Use one of our client libraries to get started quickly. Clicking on a library will take you to the Playground tab where you can tweak different inputs, see the results, and copy the corresponding code to use in your own project.

Input schema

The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.

Field Type Default value Description
text
string
The text you want spoken. Can be a single sentence or a long paragraph — long inputs are automatically split into chunks.
language
None
en
Language of the text. Use the two-letter code (en=English, fr=French, de=German, es=Spanish, ja=Japanese, zh=Chinese, ar=Arabic, el=Greek, etc.).
audio_prompt
string
Optional reference voice clip (.wav/.mp3). The output will mimic this voice. If left empty, a default voice is used.
cfg_weight
number
0.5

Max: 1

How closely the speech follows the text. Higher = sticks to the text more strictly. Lower = more freedom (but can hallucinate or get stuck).
exaggeration
number
0.5

Max: 1

How expressive the voice is. Higher = more emotional / dramatic. Lower = more flat / neutral.
temperature
number
0.8

Max: 2

Randomness of the voice. Higher = more variation between runs. Lower = more consistent / robotic.
repetition_penalty
number
2

Min: 1

Max: 5

Penalty for repeating the same sounds. Higher = less repetition.
top_p
number
1

Max: 1

Top-p (nucleus) sampling. Restricts the model to the most likely tokens. 1.0 = no restriction.
pause_between_sentences
number
0.1

Max: 5

Length of the silence (in seconds) inserted between sentences.
max_words_per_chunk
integer
60

Min: 10

Max: 200

Long texts are split into chunks before generation. This is the max number of words per chunk. Smaller = safer for tricky languages, but slower.
repeated_token_threshold
integer
3

Min: 2

Max: 10

If the model repeats the same sound this many times in a row, the chunk is cut off (prevents the model from getting stuck looping). Raise this if too much real speech is being cut.
garbage_trim_buffer
integer
25

Max: 200

Number of audio frames kept after the model finishes saying the sentence (each frame = ~40ms). Lower = trims garbage tails more aggressively but may cut off the last syllable.

Output schema

The shape of the response you’ll get when you run this model with an API.

Schema
{
  "type": "string",
  "title": "Output",
  "format": "uri"
}