chenxwh/cosyvoice2-0.5b | Readme and Docs

Scalable Streaming Speech Synthesis with Large Language Models

Public

7.6K runs

License

GitHub

Paper

Playground API Examples README Versions

CosyVoice 2.0

Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.

Multilingual

Support Language: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
Crosslingual & Mixlingual：Support zero-shot voice cloning for cross-lingual and code-switching scenarios.

Ultra-Low Latency

Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

Consistency in Timbre: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
Cross-language Synthesis: Marked improvements compared to version 1.0.

Natural Experience

Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.

Model created over 1 year ago