Microsoft VibeVoice: Open-Source Voice AI for Long-Form Speech
1 min read
Originally from github.com
View source
My notes
Summary
Microsoft’s VibeVoice is an open-source family of voice AI models covering both speech-to-text (ASR) and text-to-speech (TTS), with standout capabilities in long-form processing - 60-minute single-pass transcription with speaker diarization and 90-minute multi-speaker speech synthesis. The ASR model is now integrated into Hugging Face Transformers, supports 50+ languages, and allows custom hotwords for domain-specific accuracy.
Key Insight
- 60-minute single-pass ASR is the headline feature. Most competing models chunk audio into short segments, losing speaker context. VibeVoice-ASR processes the full hour in one pass within a 64K token window, jointly producing speaker labels, timestamps, and content.
- 7.5 Hz continuous speech tokenizers (acoustic + semantic) are the core innovation - this ultra-low frame rate dramatically cuts sequence length while preserving fidelity, making long-form processing computationally feasible.
- Architecture: LLM backbone (Qwen2.5 1.5B base) for text/context understanding + diffusion head for high-fidelity acoustic generation. Uses “next-token diffusion” framework.
- Model lineup:
- VibeVoice-ASR-7B: long-form transcription with rich output (who/when/what)
- VibeVoice-TTS-1.5B: 90-min multi-speaker (up to 4) synthesis, ICLR 2026 Oral
- VibeVoice-Realtime-0.5B: streaming TTS, ~300ms first-audio latency, deployment-friendly size
- Custom hotwords in ASR allow injecting domain-specific terms (names, technical jargon) to boost accuracy - practical for meeting transcription and specialised domains.
- Multilingual: ASR covers 50+ languages; TTS currently supports English, Chinese, and others; Realtime has experimental voices in 9 languages.
- vLLM inference support added for faster production serving.
- Finetuning code is available for the ASR model - enables domain adaptation.
- TTS code was removed (Sept 2025) due to misuse concerns (deepfakes), though the TTS paper and model weights remain on Hugging Face. This signals ongoing tension between open-source voice AI and responsible deployment.