Ollama is now powered by MLX on Apple Silicon in preview
1 min read
Originally from ollama.com
View source
My notes
Summary
Ollama 0.18 now uses Apple’s MLX framework on Apple Silicon, delivering major speedups for local LLM inference. The update includes NVFP4 quantization support for production-parity results, smarter KV cache management for agentic workflows, and a preview launch with Qwen3.5-35B-A3B optimized for coding tasks.
Key Insight
- MLX integration produces significant prefill and decode speed improvements on all Apple Silicon, with M5/M5 Pro/M5 Max chips getting additional GPU Neural Accelerator support
- Benchmarks used Qwen3.5-35B-A3B at NVFP4 precision; upcoming Ollama 0.19 promises even higher numbers (1 851 tok/s prefill, 134 tok/s decode with int4)
- NVFP4 (NVIDIA’s 4-bit floating point) is the key quantization format here - it maintains better accuracy than traditional Q4_K_M while reducing memory bandwidth, and matches what cloud inference providers use, so local results mirror production
- Cache improvements are specifically designed for agentic/coding use cases: cross-conversation cache reuse (critical for tools like Claude Code that share system prompts), intelligent checkpoint snapshots, and smarter eviction that preserves shared prefixes
- The
ollama launchcommand now natively supports Claude Code and OpenClaw as first-class targets, signaling Ollama’s pivot from chat-only to agentic infrastructure - Requires 32+ GB unified memory, which limits this to higher-end Mac configurations