Local Qwen isn't a worse Opus, it's a different tool

2 min read
local-llmself-hostingcoding-agentsquantization
View as Markdown
Originally from blog.alexellis.io
View source

My notes

Summary

A bootstrapped infra founder runs local Qwen 3.6 27B on a 12 000 USD RTX 6000 Pro and reports honestly: local models are NOT “near-Opus” for writing code, but they pay for themselves on a narrow band of work, privacy-sensitive analysis (customer telemetry, support dumps) where cloud models are contractually off-limits. The real wins are reading/explaining codebases and bounded analysis, not autonomous long-horizon coding, where the model loops and burns 600W for half an hour.

Key Insight

The benchmark gap is bigger than it looks

  • Qwen 3.6 27B scores 77.2 on SWE-Bench Verified vs Opus 4.8 at 88.6, the “only 12% behind” framing is misleading.
  • Benchmark is Python (threads/async); real-world Go (channels, contexts, distributed) exposes the gap immediately. Benchmaxxing is real.
  • Frontier models are 0.5-2T params; a 27B dense model is “on a different level,” not marginally smaller.

Where local actually earned its keep

  • Renewal audit: feeding a telemetry DB into the local model found a customer under-paying licenses by 4-5x for 12 months. That recovery alone paid for the card.
  • “diag” CLI dumps from enterprise customers run through an airgapped local model in an ephemeral Slicer VM, no customer data leaves premises.
  • Even ChatGPT Pro / Claude Max 30-day retention “likely invalidates your contracts with customers.” Near/far-east coding plans take privileged IP positions, caveat emptor.

The failure modes (why it can’t be trusted unsupervised)

  • Two loop types: (1) repeating output forever (worse, happened mid customer-support work), (2) corrupting a file then refusing to give up, going “progressively off the rails.”
  • Hallucinates filenames/tool calls when context fills (~/faas-netes became ~/faaned).
  • Arithmetic failures: 27.3K read as 273 000. Conflates “few functions” with “low usage” ignoring call frequency. Better at analysis than interpretation.
  • Looping got worse after re-enabling thinking mode.

The concrete setup (verbatim, useful)

  • llama.cpp built from source weekly; full 262144 context, f16 KV cache, --parallel 1 to keep full context (parallel 2 halves it).
  • Speculative decoding via MTP: ~93% acceptance, 67 tok/s to 130-200 tok/s sustained, “feels faster than cloud.”
  • KV-cache quant rule of thumb: bad things at Q4_0 on keys; most aggressive safe = Q8_0 keys / Q4_0 values.
  • vLLM was 3 tok/s slower than llama.cpp for single-user prosumer use, multi-minute load times, right for production batching, wrong here.
  • Follow the model card’s temperature: base Qwen 0.6, Qwopus fine-tune wants thinking OFF and temp 0.85-1.0.

The hidden cost = operations, not tokens

  • Comparing $/M tokens to GPT API pricing is the wrong comparison. Once a 2nd person uses it you inherit identity, access control, metering, quotas, model routing, power monitoring (Shelly plugs at the wall: RTX 6000 = 600W, dual 3090s = 750W and loud).
  • Skip 70B (old/generations behind) and 35-A3B (only 3B active, fast on MacBook but lower quality). Bigger frontier OSS (GLM 5.2, Kimi 2.7, Deepseek V4) need 4-6 RTX 6000 cards.