exo: Cluster Macs to Run Frontier AI Models Locally
1 min read
Originally from github.com
View source
My notes
Summary
exo is an open-source tool that clusters multiple machines (primarily Apple Silicon Macs) into a single distributed AI inference pool, letting you run frontier-scale models like DeepSeek v3.1 671B and Kimi K2 locally. Its standout feature is day-0 support for RDMA over Thunderbolt 5 on macOS 26.2, cutting inter-device latency by 99% and unlocking practical tensor parallelism (1.8x on 2 devices, 3.2x on 4). Exposes OpenAI, Claude, OpenAI Responses, and Ollama-compatible APIs on http://localhost:52415.
Key Insight
- Frontier models run on 4x M3 Ultra Mac Studios (512 GB each = ~15 TB aggregate VRAM reference). DeepSeek v3.1 671B at 8-bit and Kimi K2 Thinking at native 4-bit are demonstrated working configurations, not theoretical.
- RDMA over Thunderbolt 5 is the unlock. Apple shipped RDMA in macOS 26.2 on any Mac with TB5 (M4 Pro mini, M4 Max Studio/MacBook, M3 Ultra Studio). Activation requires recovery-mode
rdma_ctl enable. This is the first mainstream consumer RDMA fabric and makes distributed MLX inference actually fast. - Topology-aware auto-parallel splits models across devices using a real-time view of device resources and link bandwidth/latency, no manual sharding config.
- Pipeline AND tensor parallelism both supported. Tensor parallel gives near-linear speedups where pipeline just enables bigger models. Filter via
--sharding tensorinexo-bench. - Four API compatibility layers from one server:
/v1/chat/completions(OpenAI),/v1/messages(Claude),/v1/responses(OpenAI Responses),/ollama/api/*(Ollama), drop-in for most existing clients including OpenWebUI. - Coordinator-only mode (
--no-worker) lets a low-GPU machine orchestrate a cluster without contributing compute, useful for dedicated routing nodes. - Linux support is CPU-only today. GPU acceleration for Linux is on the roadmap but not shipped. macOS is the production target.
- Caveats for RDMA: devices must be fully meshed (connected to all others), TB5 cables required, OS versions must match exactly (even beta numbers), and the TB5 port next to Ethernet on Mac Studio is not usable for RDMA.
- Cluster namespace isolation via
EXO_LIBP2P_NAMESPACEprevents accidental cross-cluster joining on shared networks, important for dev/prod separation.