Microsoft VibeVoice: Open-Source Voice AI for Long-Form Speech

Curated April 10, 2026 1 min read

voice-aispeech-recognitiontext-to-speechopen-sourcemicrosoftasrreal-time-ttslong-form-audio

My notes

Summary

Microsoft’s VibeVoice is an open-source family of voice AI models covering both speech-to-text (ASR) and text-to-speech (TTS), with standout capabilities in long-form processing - 60-minute single-pass transcription with speaker diarization and 90-minute multi-speaker speech synthesis. The ASR model is now integrated into Hugging Face Transformers, supports 50+ languages, and allows custom hotwords for domain-specific accuracy.

Key Insight

60-minute single-pass ASR is the headline feature. Most competing models chunk audio into short segments, losing speaker context. VibeVoice-ASR processes the full hour in one pass within a 64K token window, jointly producing speaker labels, timestamps, and content.
7.5 Hz continuous speech tokenizers (acoustic + semantic) are the core innovation - this ultra-low frame rate dramatically cuts sequence length while preserving fidelity, making long-form processing computationally feasible.
Architecture: LLM backbone (Qwen2.5 1.5B base) for text/context understanding + diffusion head for high-fidelity acoustic generation. Uses “next-token diffusion” framework.
Model lineup:
- VibeVoice-ASR-7B: long-form transcription with rich output (who/when/what)
- VibeVoice-TTS-1.5B: 90-min multi-speaker (up to 4) synthesis, ICLR 2026 Oral
- VibeVoice-Realtime-0.5B: streaming TTS, ~300ms first-audio latency, deployment-friendly size
Custom hotwords in ASR allow injecting domain-specific terms (names, technical jargon) to boost accuracy - practical for meeting transcription and specialised domains.
Multilingual: ASR covers 50+ languages; TTS currently supports English, Chinese, and others; Realtime has experimental voices in 9 languages.
vLLM inference support added for faster production serving.
Finetuning code is available for the ASR model - enables domain adaptation.
TTS code was removed (Sept 2025) due to misuse concerns (deepfakes), though the TTS paper and model weights remain on Hugging Face. This signals ongoing tension between open-source voice AI and responsible deployment.