chenxwh/cosyvoice2-0.5b

Scalable Streaming Speech Synthesis with Large Language Models

Public
6.2K runs

CosyVoice 2.0

Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.

Multilingual

  • Support Language: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
  • Crosslingual & Mixlingual:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.

Ultra-Low Latency

  • Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
  • Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

  • Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
  • Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

  • Consistency in Timbre: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
  • Cross-language Synthesis: Marked improvements compared to version 1.0.

Natural Experience

  • Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
  • Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.