Mega-ASR: open-source ASR built for noisy real-world audio

1 min read
speech-recognitionopen-sourcetranscriptionwhisper
View as Markdown
Originally from github.com
View source

My notes

Summary

Mega-ASR is an open-source (Apache-2.0) foundation speech-recognition model built specifically for messy real-world audio - noise, far-field, reverb, recording artifacts, transmission dropout. It is trained on 7 “atomic” acoustic conditions plus 54 compound scenarios (~2.4M samples) on a Qwen3-ASR backbone, and claims up to ~30% WER gains over leading open and closed SOTA models where they collapse.

Key Insight

  • The pitch is narrow but sharp: not “best ASR overall” - best when audio is degraded and other models hallucinate, return empty output, or drop utterances. Clean-audio accuracy is admitted to be slightly worse.
  • Robustness comes from a deliberate data taxonomy: 7 atomic conditions (noise, far-field, obstruction, echo/reverb, recording artifacts, electronic distortion, transmission dropout) combined into 54 compound scenarios - the differentiator is systematic acoustic simulation, not just more data.
  • LoRA-router architecture solves the clean-audio tradeoff: training on inherently high-WER data degrades basic recognition, so a router decides per-input whether to mount the Mega-ASR LoRA delta on the base model. You get robustness only when you need it.
  • Two-stage training recipe:
    • A2S-SFT (acoustic-to-semantic progressive SFT): curriculum from WER<30% to <50% to <70%, training encoder+aligner first, then LLM on hard data for semantic recovery, then joint end-to-end fine-tune.
    • DG-WGPO RL: WER-gated policy learning - low-WER samples get token-level acoustic refinement, high-WER samples get sentence-level semantic reconstruction; reward = static WER accuracy + anti-repetition gate + dynamic dual-granularity reward (tau=0.3, as=0.4, adyn=0.6). RL code not yet released.
  • Benchmark drama in the README: on a degraded sample, Mega-ASR hit 47.1 WER while Qwen3-ASR returned empty (100), Gemini-3-Pro 86.1, Seed-ASR 85.3, Whisper 92.5. On another, 5.9 WER vs 64.7 for both Qwen3-ASR and Gemini-3-Pro.
  • Deployment-ready extras: vLLM streaming inference (materializes LoRA into a checkpoint, drops per-sample routing), long-form streaming with periodic state reset to bound memory, conservative 8GB-GPU defaults. Built on Qwen3-ASR-1.7B.
  • Fully open: model weights, training code, Voices-in-the-Wild-2M dataset, and Voices-in-the-Wild-Bench all released on HuggingFace/GitHub under Apache-2.0.