Readme

Whisper Hebrish: Whisper Large (Turbo V3) Fine-Tuned For English-Hebrew Immigrant Speech Patterns (“Hebrish”)

ASR For Mixed Speech Patterns (“Code Switching”)

Many immigrant groups at various stages of absorption into non-English-speaking societies adopt a unique linguistic hybrid: their native tongue peppered with liberal dashes of their second language.

I was born in Ireland and immigrated to Israel 10 years ago. Latterly, I have become a passionate user of AI tools - especially speech to text (STT) and automatic speech recognition (ASR).

Over the course of a year spent transcribing everything from grocery lists to blog outlines (mostly using Whisper or variations of it), I have noticed an obvious pattern: while Whisper is a superlatively good speech recognition model, the majority of Hebrew words that English immigrants might employ in daily or regular speech (“teudat zehut - ID card; mazgan - air conditioner) are not English nor are they sufficiently well-known (contrast: Shabbat, Torah) that they are present in corpora ingested into ASR training sets.

The result: the ASR attempts to transcribe phonetically. Results vary between comical and plain intelligible.

I recently created a personal fine-tune of Whisper.

While I had the notebook code handy, I thought it would be worth seeing if I could fine-tune Whisper for this purpose, which is related to one of the most important use-cases for ASR fine-tuning: tuning ASR models which are inherently multilingual on underrepresented languages.

Example

OpenAI Whisper Large (V3, Turbo) vs. Fine Tune head to head.

Demo with two words in dataset: makolet (minimarket) and teudat zehut (ID card):

TRUTH:

I went to the makolet today to pick up some bread, and I also got my teudat zehut.

FINE-TUNE:

I went to the makolet today to pick up some bread, and I also got my teudat zehut.

STOCK WHISPER:

I went to the Macaulay today to pick up some bread and I also got my Theodette Sahoot.

Methodology

I used Claude Code to generate a list of 500 Hebrew words which it believed English speakers may use in daily speech. I recorded a subset of these and added my own as they came to mind.

I recorded three variations of each word in an attempt to buttress the reliability of the fine-tune. Where variations in pronunciation exist for common words, I recorded each variant.

The dataset that this model was trained on preserves the original audio files and the ground truths - the latter in the JSONL.

Performance & WER Metrics

The training script, written by Claude Code, was based upon this excellent template provided by Modal.

I used an A100 for the training run which ran across 10 epochs and lasted approximately 30 minutes.

WER Improvement

Metric Value Baseline WER (Pre-training) 16.79% Post-training WER 6.07% Improvement 63.8% reduction Baseline Performance:

Fine-tuning the Whisper Large V3 Turbo model on English-Hebrew code-switched data resulted in a 63.8% reduction in WER, demonstrating significant improvement in transcribing mixed-language speech.

Model created 3 months, 3 weeks ago

Run time and cost