turian / whisply

Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!

  • Public
  • 27 runs
  • GitHub
  • License
Iterate in playground
Run with an API

Run time and cost

This model runs on Nvidia T4 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

whisply-replicate

A Replicate.com service for audio transcription, translation, and speaker diarization using Whisper v3 models.

Features

  • Multiple Whisper Models: Support for various Whisper models including:
  • large-v3 (default)
  • distil-large-v3
  • large-v3-turbo
  • And many more standard Whisper models

  • Advanced Audio Processing:

  • Automatic audio format conversion and normalization
  • Support for various input audio/video formats
  • GPU-accelerated processing

  • Rich Output Options:

  • Basic transcription (txt, json)
  • Subtitle generation (srt, vtt)
  • Speaker diarization with word-level timestamps
  • Translation to English

Usage

The service accepts the following parameters:

Required:

  • audio_file: Input audio/video file to process

Optional:

  • language: Language code (e.g., ‘en’, ‘fr’, ‘de’) [default: auto-detect]
  • model: Whisper model to use [default: ‘large-v3’]
  • subtitle: Generate subtitles (.srt, .vtt) [default: false]
  • sub_length: Words per subtitle segment [default: 5]
  • translate: Translate to English [default: false]
  • annotate: Enable speaker diarization [default: false]
  • num_speakers: Number of speakers to detect [default: auto-detect]
  • hf_token: HuggingFace token for speaker annotation
  • verbose: Print progress during transcription [default: false]
  • post_correction: YAML file for text corrections

Example Usage with Cog:

# Basic transcription
cog predict -i audio_file=@path/to/audio.mp3

# Full features with speaker diarization
cog predict -i audio_file=@path/to/audio.mp3 \
           -i language=en \
           -i model=large-v3 \
           -i subtitle=true \
           -i translate=true \
           -i annotate=true \
           -i hf_token=your_token_here \
           -i num_speakers=2

Output

The service returns a zip file containing: - Transcription in requested formats (txt, json) - Subtitle files if requested (srt, vtt) - Speaker annotations if enabled (rttm format) - Translated text if translation was enabled

Technical Details

  • Uses FFmpeg for audio preprocessing
  • Automatic GPU detection and utilization
  • Persistent model caching for faster startup
  • Error handling and validation for all inputs
  • Support for various audio formats through python-magic detection

Model Caching

  • Whisper v3 models are pre-downloaded during container build
  • Speaker diarization models (when using annotate=true):
  • Require a valid HuggingFace token
  • Are cached after first use
  • Use persistent storage for subsequent runs