Ollama is now powered by MLX on Apple Silicon in preview

Curated March 31, 2026 1 min read

ollamamlxapple-siliconlocal-llmnvfp4quantizationcoding-agentsinference-performance

My notes

Summary

Ollama 0.18 now uses Apple’s MLX framework on Apple Silicon, delivering major speedups for local LLM inference. The update includes NVFP4 quantization support for production-parity results, smarter KV cache management for agentic workflows, and a preview launch with Qwen3.5-35B-A3B optimized for coding tasks.

Key Insight

MLX integration produces significant prefill and decode speed improvements on all Apple Silicon, with M5/M5 Pro/M5 Max chips getting additional GPU Neural Accelerator support
Benchmarks used Qwen3.5-35B-A3B at NVFP4 precision; upcoming Ollama 0.19 promises even higher numbers (1 851 tok/s prefill, 134 tok/s decode with int4)
NVFP4 (NVIDIA’s 4-bit floating point) is the key quantization format here - it maintains better accuracy than traditional Q4_K_M while reducing memory bandwidth, and matches what cloud inference providers use, so local results mirror production
Cache improvements are specifically designed for agentic/coding use cases: cross-conversation cache reuse (critical for tools like Claude Code that share system prompts), intelligent checkpoint snapshots, and smarter eviction that preserves shared prefixes
The ollama launch command now natively supports Claude Code and OpenClaw as first-class targets, signaling Ollama’s pivot from chat-only to agentic infrastructure
Requires 32+ GB unified memory, which limits this to higher-end Mac configurations