chenxwh/cosyvoice2-0.5b | Run with an API on Replicate

Run time and cost

This model costs approximately $0.0047 to run on Replicate, or 212 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 5 seconds.

Readme

CosyVoice 2.0

Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.

Multilingual

Support Language: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
Crosslingual & Mixlingual：Support zero-shot voice cloning for cross-lingual and code-switching scenarios.

Ultra-Low Latency

Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

Consistency in Timbre: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
Cross-language Synthesis: Marked improvements compared to version 1.0.

Natural Experience

Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.