# Microsoft VibeVoice: Open-Source Voice AI for Long-Form Speech

> Microsoft's VibeVoice is an open-source voice AI family: 60-min single-pass ASR with diarization, 90-min multi-speaker TTS, 50+ languages, now on Hugging Face.

Published: 2026-04-10
URL: https://daniliants.com/insights/github---microsoftvibevoice-open-source-frontier-voice-ai/
Tags: voice-ai, speech-recognition, text-to-speech, open-source, microsoft, asr, real-time-tts, long-form-audio

---

## Summary

Microsoft's VibeVoice is an open-source family of voice AI models covering both speech-to-text (ASR) and text-to-speech (TTS), with standout capabilities in long-form processing - 60-minute single-pass transcription with speaker diarization and 90-minute multi-speaker speech synthesis. The ASR model is now integrated into Hugging Face Transformers, supports 50+ languages, and allows custom hotwords for domain-specific accuracy.

## Key Insight

- **60-minute single-pass ASR** is the headline feature. Most competing models chunk audio into short segments, losing speaker context. VibeVoice-ASR processes the full hour in one pass within a 64K token window, jointly producing speaker labels, timestamps, and content.
- **7.5 Hz continuous speech tokenizers** (acoustic + semantic) are the core innovation - this ultra-low frame rate dramatically cuts sequence length while preserving fidelity, making long-form processing computationally feasible.
- **Architecture**: LLM backbone (Qwen2.5 1.5B base) for text/context understanding + diffusion head for high-fidelity acoustic generation. Uses "next-token diffusion" framework.
- **Model lineup**:
  - VibeVoice-ASR-7B: long-form transcription with rich output (who/when/what)
  - VibeVoice-TTS-1.5B: 90-min multi-speaker (up to 4) synthesis, ICLR 2026 Oral
  - VibeVoice-Realtime-0.5B: streaming TTS, ~300ms first-audio latency, deployment-friendly size
- **Custom hotwords** in ASR allow injecting domain-specific terms (names, technical jargon) to boost accuracy - practical for meeting transcription and specialised domains.
- **Multilingual**: ASR covers 50+ languages; TTS currently supports English, Chinese, and others; Realtime has experimental voices in 9 languages.
- **vLLM inference support** added for faster production serving.
- **Finetuning code** is available for the ASR model - enables domain adaptation.
- **TTS code was removed** (Sept 2025) due to misuse concerns (deepfakes), though the TTS paper and model weights remain on Hugging Face. This signals ongoing tension between open-source voice AI and responsible deployment.