zsxkib / hibiki

Hibiki: High-Fidelity Simultaneous Speech-To-Speech Translation

  • Public
  • 12 runs
  • L40S
  • GitHub
  • Weights
  • Paper
  • License

Input

file
Preview
Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x

Input audio file to translate

file

Optional input video file

Including volume_reduction_db and 2 more...

Output

Video Player is loading.
Current Time 00:00:000
Duration 00:00:000
Loaded: 0%
Stream Type LIVE
Remaining Time 00:00:000
 
1x
Generated in

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Hibiki: Real-Time Voice-Preserving Translation

[Paper] | [Hear Samples] | [Model Weights]

Hibiki delivers real-time speech translation while preserving the speaker’s voice characteristics. Designed for seamless French→English conversion, it operates locally on consumer hardware with natural-sounding results.

Why Hibiki?

  • 🎭 Voice Preservation - Maintains speaker’s vocal identity using advanced guidance techniques
  • Instant Translation - Processes audio at 12.5 frames/sec for real-time conversion
  • 🔊 Natural Output - Generates fluent target speech with human-like prosody
  • 📝 Dual Output - Produces both translated speech and text simultaneously

Quick Translation

Run with Cog using our sample file:

sudo cog predict -i audio_input=@examples/sample_fr_hibiki_crepes.mp3

Use your own .mp3 file for custom translations.

Supported Languages

Currently supports French → English translation. More languages coming soon.

📄 Citation
If using Hibiki in research, please cite our paper.

Model weights licensed under CC-BY 4.0
Inference code MIT licensed


Maintained by @zsxkib for Replicate integration (follow me on X/Twitter for updates)!