Friends Don't Let Friends Use Ollama

2 min read
ollamallama-cpplocal-llmggufself-hostingopen-sourcevendor-lock-in
View as Markdown
Originally from sleepingrobots.com
View source

My notes

Summary

Ollama built its reputation as an easy wrapper around llama.cpp but spent years dodging attribution, forking ggml badly, misnaming models (e.g. listing 8B distilled Qwen/Llama variants as “DeepSeek-R1”), shipping a closed-source GUI, and pivoting to VC-backed cloud services. On equal hardware llama.cpp delivers 1.8x throughput (161 vs 89 tok/s) and roughly 30-70% more on CPU/code generation workloads. Multiple mature alternatives (llama.cpp server, LM Studio, Jan, Msty, koboldcpp, ramalama) give the same “one command” convenience without the middleman.

Key Insight

  • Performance tax is real: llama.cpp runs 1.8x faster on GPU (161 vs 89 tok/s), 30-50% faster on CPU, roughly 70% faster on Qwen-3 Coder 32B. Overhead comes from Ollama’s daemon layer, poor GPU offloading heuristics, and a vendored backend trailing upstream.
  • Vendor lock-in by design: Ollama stores GGUFs under hashed blob filenames, so you can’t point LM Studio or llama.cpp at the same file. The “bring your own GGUF” path is deliberately friction-filled via Modelfiles.
  • Modelfile is anti-GGUF: GGUF’s design goal (single-file, all metadata embedded) is undone by Ollama reintroducing separate config files. Editing one parameter copies the full 30-60 GB model. llama.cpp uses CLI flags.
  • Chat template breakage: Ollama only auto-detects templates from a hardcoded list. Valid Jinja templates in GGUF metadata silently fall back to bare {{ .Prompt }}, breaking model instruction format. Users must translate Jinja to Go template syntax manually.
  • Quantization ceiling: Ollama can only create Q4_K_S, Q4_K_M, Q8_0, F16, F32. No Q5_K_M, Q6_K, or IQ quants, so quantization has to happen outside Ollama.
  • Model naming fraud: ollama run deepseek-r1 pulls an 8B distilled Qwen, not the 671B real model. GitHub issues #8557, #8698 requesting separation closed as duplicates with no fix. Did reputational damage to DeepSeek.
  • Attribution pattern: Over 400 days ignoring issue #3185 on MIT license compliance. Single-line llama.cpp credit added to README bottom after PR pressure. Georgi Gerganov publicly called out Ollama’s bad GGML fork.
  • Cloud pivot contradicts local-first brand: MiniMax and similar proprietary models appear in the model list without disclosure that prompts get routed off-machine. Third-party provider data handling undocumented. CVE-2025-51471 (token exfiltration via malicious registry) took months to patch.
  • The VC playbook: launch on OSS, minimize attribution, create lock-in (hashed blobs, Modelfile), ship closed-source components (GUI app July 2025), monetize via cloud.