You're looking at a specific version of this model. Jump to the model overview.
collectiveai-team /whisper-wordtimestamps:fa4deeda
Input schema
The fields you can use to run this model with an API. If you don’t give a value for a field its default value will be used.
Field | Type | Default value | Description |
---|---|---|---|
audio |
string
|
Audio file
|
|
audio_url |
string
|
Audio URL
|
|
model |
string
(enum)
|
base
Options: tiny, base, small, medium, large-v1, large-v2 |
Choose a Whisper model.
|
language |
string
(enum)
|
Options: af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, zh |
language spoken in the audio, specify None to perform language detection
|
temperature |
number
|
0
|
temperature to use for sampling
|
patience |
number
|
optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search
|
|
suppress_tokens |
string
|
-1
|
comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations
|
initial_prompt |
string
|
optional text to provide as a prompt for the first window.
|
|
condition_on_previous_text |
boolean
|
True
|
if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop
|
temperature_increment_on_fallback |
number
|
0.2
|
temperature to increase when falling back when the decoding fails to meet either of the thresholds below
|
compression_ratio_threshold |
number
|
2.4
|
if the gzip compression ratio is higher than this value, treat the decoding as failed
|
logprob_threshold |
number
|
-1
|
if the average log probability is lower than this value, treat the decoding as failed
|
no_speech_threshold |
number
|
0.6
|
if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
|
word_timestamps |
boolean
|
False
|
Extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment.
|
prepend_punctuations |
string
|
"'“¿([{-
|
If word_timestamps is True, merge these punctuation symbols with the next word
|
append_punctuations |
string
|
"'.。,,!!??::”)]}、
|
If word_timestamps is True, merge these punctuation symbols with the previous word
|
Output schema
The shape of the response you’ll get when you run this model with an API.
{'properties': {'detected_language': {'title': 'Detected Language',
'type': 'string'},
'segments': {'title': 'Segments'},
'transcription': {'title': 'Transcription', 'type': 'string'}},
'required': ['detected_language', 'transcription'],
'title': 'ModelOutput',
'type': 'object'}