# Ollama is now powered by MLX on Apple Silicon in preview

> Ollama 0.18 now uses Apple MLX on Apple Silicon for faster local LLM inference, with NVFP4 quantization, better KV cache, and Qwen3.5-35B-A3B in preview.

Published: 2026-03-31
URL: https://daniliants.com/insights/ollama-is-now-powered-by-mlx-on-apple-silicon-in-preview/
Tags: ollama, mlx, apple-silicon, local-llm, nvfp4, quantization, coding-agents, inference-performance

---

## Summary

Ollama 0.18 now uses Apple's MLX framework on Apple Silicon, delivering major speedups for local LLM inference. The update includes NVFP4 quantization support for production-parity results, smarter KV cache management for agentic workflows, and a preview launch with Qwen3.5-35B-A3B optimized for coding tasks.

## Key Insight

- MLX integration produces significant prefill and decode speed improvements on all Apple Silicon, with M5/M5 Pro/M5 Max chips getting additional GPU Neural Accelerator support
- Benchmarks used Qwen3.5-35B-A3B at NVFP4 precision; upcoming Ollama 0.19 promises even higher numbers (1 851 tok/s prefill, 134 tok/s decode with int4)
- NVFP4 (NVIDIA's 4-bit floating point) is the key quantization format here - it maintains better accuracy than traditional Q4_K_M while reducing memory bandwidth, and matches what cloud inference providers use, so local results mirror production
- Cache improvements are specifically designed for agentic/coding use cases: cross-conversation cache reuse (critical for tools like Claude Code that share system prompts), intelligent checkpoint snapshots, and smarter eviction that preserves shared prefixes
- The `ollama launch` command now natively supports Claude Code and OpenClaw as first-class targets, signaling Ollama's pivot from chat-only to agentic infrastructure
- Requires 32+ GB unified memory, which limits this to higher-end Mac configurations