zsxkib / whisper-lazyloading

Convert speech in audio to text w/ `tiny`, `small`, `base`, and `large-v3` models (Updated 11 months, 2 weeks ago)

Cold

Public
128 runs
T4
GitHub
Paper
License

Run with an API

Playground API Examples README Versions

Input

Video Player is loading.

Current Time 00:00:000

Duration 00:00:000

Loaded: 0%

Stream Type LIVE

Remaining Time 00:00:000

audio

*file

Audio file

model

string

Whisper model size (currently only large-v3 is supported).

Default: "large-v3"

transcription

string

Choose the format for the transcription

Default: "plain text"

translate

boolean

Translate the text to English when set to True

Default: false

language

string

Language spoken in the audio, specify 'auto' for automatic language detection

Default: "auto"

temperature

number

temperature to use for sampling

Default: 0

patience

number

optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search

suppress_tokens

string

Shift + Return to add a new line

comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations

Default: "-1"

initial_prompt

string

Shift + Return to add a new line

optional text to provide as a prompt for the first window.

condition_on_previous_text

boolean

if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop

Default: true

temperature_increment_on_fallback

number

temperature to increase when falling back when the decoding fails to meet either of the thresholds below

Default: 0.2

compression_ratio_threshold

number

if the gzip compression ratio is higher than this value, treat the decoding as failed

Default: 2.4

logprob_threshold

number

if the average log probability is lower than this value, treat the decoding as failed

Default: -1

no_speech_threshold

number

if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence

Default: 0.6

Run this model in Node.js with one line of code:

npx create-replicate --model=zsxkib/whisper-lazyloading

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run zsxkib/whisper-lazyloading using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "zsxkib/whisper-lazyloading:909df2f50ba92488979e2c3dea577937b7e991bd815395d3dfbe3bcbf5038276",
  {
    input: {
      audio: "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav",
      model: "large-v3",
      language: "auto",
      translate: false,
      temperature: 0,
      transcription: "plain text",
      suppress_tokens: "-1",
      logprob_threshold: -1,
      no_speech_threshold: 0.6,
      condition_on_previous_text: true,
      compression_ratio_threshold: 2.4,
      temperature_increment_on_fallback: 0.2
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run zsxkib/whisper-lazyloading using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "zsxkib/whisper-lazyloading:909df2f50ba92488979e2c3dea577937b7e991bd815395d3dfbe3bcbf5038276",
    input={
        "audio": "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav",
        "model": "large-v3",
        "language": "auto",
        "translate": False,
        "temperature": 0,
        "transcription": "plain text",
        "suppress_tokens": "-1",
        "logprob_threshold": -1,
        "no_speech_threshold": 0.6,
        "condition_on_previous_text": True,
        "compression_ratio_threshold": 2.4,
        "temperature_increment_on_fallback": 0.2
    }
)
print(output)

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run zsxkib/whisper-lazyloading using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "zsxkib/whisper-lazyloading:909df2f50ba92488979e2c3dea577937b7e991bd815395d3dfbe3bcbf5038276",
    "input": {
      "audio": "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav",
      "model": "large-v3",
      "language": "auto",
      "translate": false,
      "temperature": 0,
      "transcription": "plain text",
      "suppress_tokens": "-1",
      "logprob_threshold": -1,
      "no_speech_threshold": 0.6,
      "condition_on_previous_text": true,
      "compression_ratio_threshold": 2.4,
      "temperature_increment_on_fallback": 0.2
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

segments

[ { "id": 0, "end": 18.6, "seek": 0, "text": " the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit for a queen's table a big wet stain was on the round carpet", "start": 0, "tokens": [ 50365, 264, 707, 27254, 436, 980, 366, 7908, 264, 2853, 390, 2159, 986, 9376, 293, 13436, 292, 382, 731, 31421, 520, 685, 366, 3318, 337, 257, 12206, 311, 3199, 257, 955, 6630, 16441, 390, 322, 264, 3098, 18119, 51295 ], "avg_logprob": -0.060147291276513075, "temperature": 0, "no_speech_prob": 0.05821266025304794, "compression_ratio": 1.412280701754386 }, { "id": 1, "end": 31.840000000000003, "seek": 1860, "text": " the kite dipped and swayed but stayed aloft the pleasant hours fly by much too soon the room was crowded with a mild wab", "start": 18.6, "tokens": [ 50365, 264, 38867, 45162, 293, 27555, 292, 457, 9181, 419, 6750, 264, 16232, 2496, 3603, 538, 709, 886, 2321, 264, 1808, 390, 21634, 365, 257, 15154, 261, 455, 51027 ], "avg_logprob": -0.11862952368600028, "temperature": 0, "no_speech_prob": 0.00025310463388450444, "compression_ratio": 1.696969696969697 }, { "id": 2, "end": 45.2, "seek": 1860, "text": " the room was crowded with a wild mob this strong arm shall shield your honour she blushed when he gave her a white orchid", "start": 31.840000000000003, "tokens": [ 51027, 264, 1808, 390, 21634, 365, 257, 4868, 4298, 341, 2068, 3726, 4393, 10257, 428, 20631, 750, 25218, 292, 562, 415, 2729, 720, 257, 2418, 34850, 327, 51695 ], "avg_logprob": -0.11862952368600028, "temperature": 0, "no_speech_prob": 0.00025310463388450444, "compression_ratio": 1.696969696969697 }, { "id": 3, "end": 48.6, "seek": 1860, "text": " the beetle droned in the hot june sun", "start": 45.2, "tokens": [ 51695, 264, 49735, 1224, 19009, 294, 264, 2368, 361, 2613, 3295, 51865 ], "avg_logprob": -0.11862952368600028, "temperature": 0, "no_speech_prob": 0.00025310463388450444, "compression_ratio": 1.696969696969697 }, { "id": 4, "end": 52.38, "seek": 4860, "text": " the beetle droned in the hot june sun", "start": 48.6, "tokens": [ 50365, 264, 49735, 1224, 19009, 294, 264, 2368, 361, 2613, 3295, 50554 ], "avg_logprob": -0.3010915426107553, "temperature": 0.4, "no_speech_prob": 0.2937493324279785, "compression_ratio": 0.8409090909090909 } ]

transcription

the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit for a queen's table a big wet stain was on the round carpet the kite dipped and swayed but stayed aloft the pleasant hours fly by much too soon the room was crowded with a mild wab the room was crowded with a wild mob this strong arm shall shield your honour she blushed when he gave her a white orchid the beetle droned in the hot june sun the beetle droned in the hot june sun

detected_language

english

{
  "completed_at": "2024-07-01T19:30:41.049082Z",
  "created_at": "2024-07-01T19:27:17.854000Z",
  "data_removed": false,
  "error": null,
  "id": "32jnpcrdbsrgc0cgdz2byhnkmc",
  "input": {
    "audio": "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav",
    "model": "large-v3",
    "language": "auto",
    "translate": false,
    "temperature": 0,
    "transcription": "plain text",
    "suppress_tokens": "-1",
    "logprob_threshold": -1,
    "no_speech_threshold": 0.6,
    "condition_on_previous_text": true,
    "compression_ratio_threshold": 2.4,
    "temperature_increment_on_fallback": 0.2
  },
  "logs": "Transcribe with large-v3 model.\nDetected language: English\n  0%|          | 0/5241 [00:00<?, ?frames/s]\n 35%|███▌      | 1860/5241 [00:02<00:04, 706.14frames/s]\n 93%|█████████▎| 4860/5241 [00:06<00:00, 755.37frames/s]\n100%|██████████| 5241/5241 [00:08<00:00, 554.70frames/s]\n100%|██████████| 5241/5241 [00:08<00:00, 608.37frames/s]",
  "metrics": {
    "predict_time": 9.784276253,
    "total_time": 203.195082
  },
  "output": {
    "segments": [
      {
        "id": 0,
        "end": 18.6,
        "seek": 0,
        "text": " the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit for a queen's table a big wet stain was on the round carpet",
        "start": 0,
        "tokens": [
          50365,
          264,
          707,
          27254,
          436,
          980,
          366,
          7908,
          264,
          2853,
          390,
          2159,
          986,
          9376,
          293,
          13436,
          292,
          382,
          731,
          31421,
          520,
          685,
          366,
          3318,
          337,
          257,
          12206,
          311,
          3199,
          257,
          955,
          6630,
          16441,
          390,
          322,
          264,
          3098,
          18119,
          51295
        ],
        "avg_logprob": -0.060147291276513075,
        "temperature": 0,
        "no_speech_prob": 0.05821266025304794,
        "compression_ratio": 1.412280701754386
      },
      {
        "id": 1,
        "end": 31.840000000000003,
        "seek": 1860,
        "text": " the kite dipped and swayed but stayed aloft the pleasant hours fly by much too soon the room was crowded with a mild wab",
        "start": 18.6,
        "tokens": [
          50365,
          264,
          38867,
          45162,
          293,
          27555,
          292,
          457,
          9181,
          419,
          6750,
          264,
          16232,
          2496,
          3603,
          538,
          709,
          886,
          2321,
          264,
          1808,
          390,
          21634,
          365,
          257,
          15154,
          261,
          455,
          51027
        ],
        "avg_logprob": -0.11862952368600028,
        "temperature": 0,
        "no_speech_prob": 0.00025310463388450444,
        "compression_ratio": 1.696969696969697
      },
      {
        "id": 2,
        "end": 45.2,
        "seek": 1860,
        "text": " the room was crowded with a wild mob this strong arm shall shield your honour she blushed when he gave her a white orchid",
        "start": 31.840000000000003,
        "tokens": [
          51027,
          264,
          1808,
          390,
          21634,
          365,
          257,
          4868,
          4298,
          341,
          2068,
          3726,
          4393,
          10257,
          428,
          20631,
          750,
          25218,
          292,
          562,
          415,
          2729,
          720,
          257,
          2418,
          34850,
          327,
          51695
        ],
        "avg_logprob": -0.11862952368600028,
        "temperature": 0,
        "no_speech_prob": 0.00025310463388450444,
        "compression_ratio": 1.696969696969697
      },
      {
        "id": 3,
        "end": 48.6,
        "seek": 1860,
        "text": " the beetle droned in the hot june sun",
        "start": 45.2,
        "tokens": [
          51695,
          264,
          49735,
          1224,
          19009,
          294,
          264,
          2368,
          361,
          2613,
          3295,
          51865
        ],
        "avg_logprob": -0.11862952368600028,
        "temperature": 0,
        "no_speech_prob": 0.00025310463388450444,
        "compression_ratio": 1.696969696969697
      },
      {
        "id": 4,
        "end": 52.38,
        "seek": 4860,
        "text": " the beetle droned in the hot june sun",
        "start": 48.6,
        "tokens": [
          50365,
          264,
          49735,
          1224,
          19009,
          294,
          264,
          2368,
          361,
          2613,
          3295,
          50554
        ],
        "avg_logprob": -0.3010915426107553,
        "temperature": 0.4,
        "no_speech_prob": 0.2937493324279785,
        "compression_ratio": 0.8409090909090909
      }
    ],
    "translation": null,
    "transcription": " the little tales they tell are false the door was barred locked and bolted as well ripe pears are fit for a queen's table a big wet stain was on the round carpet the kite dipped and swayed but stayed aloft the pleasant hours fly by much too soon the room was crowded with a mild wab the room was crowded with a wild mob this strong arm shall shield your honour she blushed when he gave her a white orchid the beetle droned in the hot june sun the beetle droned in the hot june sun",
    "detected_language": "english"
  },
  "started_at": "2024-07-01T19:30:31.264806Z",
  "status": "succeeded",
  "urls": {
    "get": "https://api.replicate.com/v1/predictions/32jnpcrdbsrgc0cgdz2byhnkmc",
    "cancel": "https://api.replicate.com/v1/predictions/32jnpcrdbsrgc0cgdz2byhnkmc/cancel"
  },
  "version": "2ebca49b6d1c835019ddd5a0a3429b76094c8f37fdefa92a5ef6f05a68bf3746"
}

Generated in

9.8 seconds

Tweak it Share Report View full prediction

This output was created using a different version of the model, zsxkib/whisper-lazyloading:2ebca49b.

Run time and cost

This model runs on Nvidia T4 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Whisper w/ Lazy Loading

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.

This version allows users to choose between different model sizes, offering flexibility for various use cases.

Model Versions

Model Size	Description
tiny	Fastest, lowest accuracy
base	Fast, lower accuracy
small	Balanced speed and accuracy
medium	Slower, higher accuracy
large-v3	Slowest, highest accuracy

For the specific version using only the large-v3 model, check out our single-model version.

Model Description

Approach

Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.

[Blog] [Paper] [Model card]

License

The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}