Parlor: On-Device Real-Time Voice and Vision AI

Curated April 6, 2026 1 min read

on-device-aimultimodalvoice-aigemmalocal-llmself-hostingpythonrealtime

My notes

Summary

Parlor is an open-source project that runs real-time voice + vision AI entirely on local hardware using Google’s Gemma 4 E2B model and Kokoro TTS. As of April 2026, it runs in real-time on an Apple M3 Pro (3 GB RAM), eliminating server costs. The project emerged from a self-hosted English learning app with hundreds of monthly active users whose server costs needed eliminating.

Key Insight

Hardware threshold has collapsed fast. Six months before writing, an RTX 5090 was needed for real-time voice models. Now an M3 Pro handles voice + vision together - a massive shift in what local AI can do.
End-to-end latency: ~2.5-3.0 s - speech/vision understanding (~1.8-2.2 s) + response generation at ~83 tokens/sec + TTS (~0.3-0.7 s). Usable for conversational flow.
Architecture is simple and portable. Browser captures mic + camera via WebSocket, FastAPI server handles Gemma 4 inference (LiteRT-LM on GPU) and Kokoro TTS (MLX on Mac, ONNX on Linux). No cloud dependency.
Key UX features included: hands-free voice activity detection (Silero VAD), barge-in interruption mid-sentence, sentence-level TTS streaming (audio before full response).
Near-future implication: if this runs on an M3 Pro today, it will run on phones within a couple of years - enabling multilingual, camera-aware AI tutors with zero marginal cost per user.
Stack: Python 3.12+, uv, FastAPI, Gemma 4 E2B (2.6 GB), Kokoro 82M TTS. Works on Apple Silicon and Linux + GPU.