Microsoft VibeVoice: Open-Source Voice AI for Long-Form Speech

1 min read
voice-aispeech-recognitiontext-to-speechopen-sourcemicrosoftasrreal-time-ttslong-form-audio
View as Markdown
Originally from github.com
View source

My notes

Summary

Microsoft’s VibeVoice is an open-source family of voice AI models covering both speech-to-text (ASR) and text-to-speech (TTS), with standout capabilities in long-form processing - 60-minute single-pass transcription with speaker diarization and 90-minute multi-speaker speech synthesis. The ASR model is now integrated into Hugging Face Transformers, supports 50+ languages, and allows custom hotwords for domain-specific accuracy.

Key Insight

  • 60-minute single-pass ASR is the headline feature. Most competing models chunk audio into short segments, losing speaker context. VibeVoice-ASR processes the full hour in one pass within a 64K token window, jointly producing speaker labels, timestamps, and content.
  • 7.5 Hz continuous speech tokenizers (acoustic + semantic) are the core innovation - this ultra-low frame rate dramatically cuts sequence length while preserving fidelity, making long-form processing computationally feasible.
  • Architecture: LLM backbone (Qwen2.5 1.5B base) for text/context understanding + diffusion head for high-fidelity acoustic generation. Uses “next-token diffusion” framework.
  • Model lineup:
    • VibeVoice-ASR-7B: long-form transcription with rich output (who/when/what)
    • VibeVoice-TTS-1.5B: 90-min multi-speaker (up to 4) synthesis, ICLR 2026 Oral
    • VibeVoice-Realtime-0.5B: streaming TTS, ~300ms first-audio latency, deployment-friendly size
  • Custom hotwords in ASR allow injecting domain-specific terms (names, technical jargon) to boost accuracy - practical for meeting transcription and specialised domains.
  • Multilingual: ASR covers 50+ languages; TTS currently supports English, Chinese, and others; Realtime has experimental voices in 9 languages.
  • vLLM inference support added for faster production serving.
  • Finetuning code is available for the ASR model - enables domain adaptation.
  • TTS code was removed (Sept 2025) due to misuse concerns (deepfakes), though the TTS paper and model weights remain on Hugging Face. This signals ongoing tension between open-source voice AI and responsible deployment.