# Friends Don't Let Friends Use Ollama

> Ollama wraps llama.cpp but skipped attribution, forked ggml badly, and pivoted to VC-backed cloud. llama.cpp delivers up to 1.8x throughput on the same hardware.

Published: 2026-04-16
URL: https://daniliants.com/insights/friends-dont-let-friends-use-ollama/
Tags: ollama, llama-cpp, local-llm, gguf, self-hosting, open-source, vendor-lock-in

---

## Summary

Ollama built its reputation as an easy wrapper around llama.cpp but spent years dodging attribution, forking ggml badly, misnaming models (e.g. listing 8B distilled Qwen/Llama variants as "DeepSeek-R1"), shipping a closed-source GUI, and pivoting to VC-backed cloud services. On equal hardware llama.cpp delivers 1.8x throughput (161 vs 89 tok/s) and roughly 30-70% more on CPU/code generation workloads. Multiple mature alternatives (llama.cpp server, LM Studio, Jan, Msty, koboldcpp, ramalama) give the same "one command" convenience without the middleman.

## Key Insight

- **Performance tax is real**: llama.cpp runs 1.8x faster on GPU (161 vs 89 tok/s), 30-50% faster on CPU, roughly 70% faster on Qwen-3 Coder 32B. Overhead comes from Ollama's daemon layer, poor GPU offloading heuristics, and a vendored backend trailing upstream.
- **Vendor lock-in by design**: Ollama stores GGUFs under hashed blob filenames, so you can't point LM Studio or llama.cpp at the same file. The "bring your own GGUF" path is deliberately friction-filled via Modelfiles.
- **Modelfile is anti-GGUF**: GGUF's design goal (single-file, all metadata embedded) is undone by Ollama reintroducing separate config files. Editing one parameter copies the full 30-60 GB model. llama.cpp uses CLI flags.
- **Chat template breakage**: Ollama only auto-detects templates from a hardcoded list. Valid Jinja templates in GGUF metadata silently fall back to bare `{{ .Prompt }}`, breaking model instruction format. Users must translate Jinja to Go template syntax manually.
- **Quantization ceiling**: Ollama can only create Q4_K_S, Q4_K_M, Q8_0, F16, F32. No Q5_K_M, Q6_K, or IQ quants, so quantization has to happen outside Ollama.
- **Model naming fraud**: `ollama run deepseek-r1` pulls an 8B distilled Qwen, not the 671B real model. GitHub issues #8557, #8698 requesting separation closed as duplicates with no fix. Did reputational damage to DeepSeek.
- **Attribution pattern**: Over 400 days ignoring issue #3185 on MIT license compliance. Single-line llama.cpp credit added to README bottom after PR pressure. Georgi Gerganov publicly called out Ollama's bad GGML fork.
- **Cloud pivot contradicts local-first brand**: MiniMax and similar proprietary models appear in the model list without disclosure that prompts get routed off-machine. Third-party provider data handling undocumented. CVE-2025-51471 (token exfiltration via malicious registry) took months to patch.
- **The VC playbook**: launch on OSS, minimize attribution, create lock-in (hashed blobs, Modelfile), ship closed-source components (GUI app July 2025), monetize via cloud.