Parlor - On-Device Real-Time Voice and Vision AI Using Gemma 4 and Kokoro TTS
1 min read
Originally from github.com
View source
My notes
Summary
Parlor is an open-source project that runs real-time voice + vision AI entirely on local hardware using Google’s Gemma 4 E2B model and Kokoro TTS. As of April 2026, it runs in real-time on an Apple M3 Pro (3 GB RAM), eliminating server costs. The project emerged from a self-hosted English learning app with hundreds of monthly active users whose server costs needed eliminating.
Key Insight
- Hardware threshold has collapsed fast. Six months before writing, an RTX 5090 was needed for real-time voice models. Now an M3 Pro handles voice + vision together - a massive shift in what local AI can do.
- End-to-end latency: ~2.5-3.0 s - speech/vision understanding (~1.8-2.2 s) + response generation at ~83 tokens/sec + TTS (~0.3-0.7 s). Usable for conversational flow.
- Architecture is simple and portable. Browser captures mic + camera via WebSocket, FastAPI server handles Gemma 4 inference (LiteRT-LM on GPU) and Kokoro TTS (MLX on Mac, ONNX on Linux). No cloud dependency.
- Key UX features included: hands-free voice activity detection (Silero VAD), barge-in interruption mid-sentence, sentence-level TTS streaming (audio before full response).
- Near-future implication: if this runs on an M3 Pro today, it will run on phones within a couple of years - enabling multilingual, camera-aware AI tutors with zero marginal cost per user.
- Stack: Python 3.12+, uv, FastAPI, Gemma 4 E2B (2.6 GB), Kokoro 82M TTS. Works on Apple Silicon and Linux + GPU.