Parlor - On-Device Real-Time Voice and Vision AI Using Gemma 4 and Kokoro TTS

1 min read
on-device-aimultimodalvoice-aigemmalocal-llmself-hostingpythonrealtime
Originally from github.com
View source

My notes

Summary

Parlor is an open-source project that runs real-time voice + vision AI entirely on local hardware using Google’s Gemma 4 E2B model and Kokoro TTS. As of April 2026, it runs in real-time on an Apple M3 Pro (3 GB RAM), eliminating server costs. The project emerged from a self-hosted English learning app with hundreds of monthly active users whose server costs needed eliminating.

Key Insight

  • Hardware threshold has collapsed fast. Six months before writing, an RTX 5090 was needed for real-time voice models. Now an M3 Pro handles voice + vision together - a massive shift in what local AI can do.
  • End-to-end latency: ~2.5-3.0 s - speech/vision understanding (~1.8-2.2 s) + response generation at ~83 tokens/sec + TTS (~0.3-0.7 s). Usable for conversational flow.
  • Architecture is simple and portable. Browser captures mic + camera via WebSocket, FastAPI server handles Gemma 4 inference (LiteRT-LM on GPU) and Kokoro TTS (MLX on Mac, ONNX on Linux). No cloud dependency.
  • Key UX features included: hands-free voice activity detection (Silero VAD), barge-in interruption mid-sentence, sentence-level TTS streaming (audio before full response).
  • Near-future implication: if this runs on an M3 Pro today, it will run on phones within a couple of years - enabling multilingual, camera-aware AI tutors with zero marginal cost per user.
  • Stack: Python 3.12+, uv, FastAPI, Gemma 4 E2B (2.6 GB), Kokoro 82M TTS. Works on Apple Silicon and Linux + GPU.