# Mega-ASR: open-source ASR built for noisy real-world audio

> Open-source (Apache-2.0) foundation speech-recognition model built for messy audio, claiming up to ~30% WER gains over SOTA where other models collapse.

Published: 2026-06-18
URL: https://daniliants.com/insights/mega-asr-open-source-asr-built-for-noisy-real-world-audio/
Tags: speech-recognition, open-source, transcription, whisper

---

## Summary

Mega-ASR is an open-source (Apache-2.0) foundation speech-recognition model built specifically for messy real-world audio - noise, far-field, reverb, recording artifacts, transmission dropout. It is trained on 7 "atomic" acoustic conditions plus 54 compound scenarios (~2.4M samples) on a Qwen3-ASR backbone, and claims up to ~30% WER gains over leading open and closed SOTA models where they collapse.

## Key Insight

- **The pitch is narrow but sharp**: not "best ASR overall" - best when audio is degraded and other models hallucinate, return empty output, or drop utterances. Clean-audio accuracy is admitted to be slightly worse.
- **Robustness comes from a deliberate data taxonomy**: 7 atomic conditions (noise, far-field, obstruction, echo/reverb, recording artifacts, electronic distortion, transmission dropout) combined into 54 compound scenarios - the differentiator is systematic acoustic simulation, not just more data.
- **LoRA-router architecture solves the clean-audio tradeoff**: training on inherently high-WER data degrades basic recognition, so a router decides per-input whether to mount the Mega-ASR LoRA delta on the base model. You get robustness only when you need it.
- **Two-stage training recipe**:
  - **A2S-SFT** (acoustic-to-semantic progressive SFT): curriculum from WER<30% to <50% to <70%, training encoder+aligner first, then LLM on hard data for semantic recovery, then joint end-to-end fine-tune.
  - **DG-WGPO RL**: WER-gated policy learning - low-WER samples get token-level acoustic refinement, high-WER samples get sentence-level semantic reconstruction; reward = static WER accuracy + anti-repetition gate + dynamic dual-granularity reward (tau=0.3, as=0.4, adyn=0.6). RL code not yet released.
- **Benchmark drama in the README**: on a degraded sample, Mega-ASR hit 47.1 WER while Qwen3-ASR returned empty (100), Gemini-3-Pro 86.1, Seed-ASR 85.3, Whisper 92.5. On another, 5.9 WER vs 64.7 for both Qwen3-ASR and Gemini-3-Pro.
- **Deployment-ready extras**: vLLM streaming inference (materializes LoRA into a checkpoint, drops per-sample routing), long-form streaming with periodic state reset to bound memory, conservative 8GB-GPU defaults. Built on Qwen3-ASR-1.7B.
- **Fully open**: model weights, training code, Voices-in-the-Wild-2M dataset, and Voices-in-the-Wild-Bench all released on HuggingFace/GitHub under Apache-2.0.