Friends Don't Let Friends Use Ollama

Curated April 16, 2026 2 min read

ollamallama-cpplocal-llmggufself-hostingopen-sourcevendor-lock-in

My notes

Summary

Ollama built its reputation as an easy wrapper around llama.cpp but spent years dodging attribution, forking ggml badly, misnaming models (e.g. listing 8B distilled Qwen/Llama variants as “DeepSeek-R1”), shipping a closed-source GUI, and pivoting to VC-backed cloud services. On equal hardware llama.cpp delivers 1.8x throughput (161 vs 89 tok/s) and roughly 30-70% more on CPU/code generation workloads. Multiple mature alternatives (llama.cpp server, LM Studio, Jan, Msty, koboldcpp, ramalama) give the same “one command” convenience without the middleman.

Key Insight

Performance tax is real: llama.cpp runs 1.8x faster on GPU (161 vs 89 tok/s), 30-50% faster on CPU, roughly 70% faster on Qwen-3 Coder 32B. Overhead comes from Ollama’s daemon layer, poor GPU offloading heuristics, and a vendored backend trailing upstream.
Vendor lock-in by design: Ollama stores GGUFs under hashed blob filenames, so you can’t point LM Studio or llama.cpp at the same file. The “bring your own GGUF” path is deliberately friction-filled via Modelfiles.
Modelfile is anti-GGUF: GGUF’s design goal (single-file, all metadata embedded) is undone by Ollama reintroducing separate config files. Editing one parameter copies the full 30-60 GB model. llama.cpp uses CLI flags.
Chat template breakage: Ollama only auto-detects templates from a hardcoded list. Valid Jinja templates in GGUF metadata silently fall back to bare {{ .Prompt }}, breaking model instruction format. Users must translate Jinja to Go template syntax manually.
Quantization ceiling: Ollama can only create Q4_K_S, Q4_K_M, Q8_0, F16, F32. No Q5_K_M, Q6_K, or IQ quants, so quantization has to happen outside Ollama.
Model naming fraud: ollama run deepseek-r1 pulls an 8B distilled Qwen, not the 671B real model. GitHub issues #8557, #8698 requesting separation closed as duplicates with no fix. Did reputational damage to DeepSeek.
Attribution pattern: Over 400 days ignoring issue #3185 on MIT license compliance. Single-line llama.cpp credit added to README bottom after PR pressure. Georgi Gerganov publicly called out Ollama’s bad GGML fork.
Cloud pivot contradicts local-first brand: MiniMax and similar proprietary models appear in the model list without disclosure that prompts get routed off-machine. Third-party provider data handling undocumented. CVE-2025-51471 (token exfiltration via malicious registry) took months to patch.
The VC playbook: launch on OSS, minimize attribution, create lock-in (hashed blobs, Modelfile), ship closed-source components (GUI app July 2025), monetize via cloud.