Local Qwen isn't a worse Opus, it's a different tool
2 min read
Originally from blog.alexellis.io
View source
My notes
Summary
A bootstrapped infra founder runs local Qwen 3.6 27B on a 12 000 USD RTX 6000 Pro and reports honestly: local models are NOT “near-Opus” for writing code, but they pay for themselves on a narrow band of work, privacy-sensitive analysis (customer telemetry, support dumps) where cloud models are contractually off-limits. The real wins are reading/explaining codebases and bounded analysis, not autonomous long-horizon coding, where the model loops and burns 600W for half an hour.
Key Insight
The benchmark gap is bigger than it looks
- Qwen 3.6 27B scores 77.2 on SWE-Bench Verified vs Opus 4.8 at 88.6, the “only 12% behind” framing is misleading.
- Benchmark is Python (threads/async); real-world Go (channels, contexts, distributed) exposes the gap immediately. Benchmaxxing is real.
- Frontier models are 0.5-2T params; a 27B dense model is “on a different level,” not marginally smaller.
Where local actually earned its keep
- Renewal audit: feeding a telemetry DB into the local model found a customer under-paying licenses by 4-5x for 12 months. That recovery alone paid for the card.
- “diag” CLI dumps from enterprise customers run through an airgapped local model in an ephemeral Slicer VM, no customer data leaves premises.
- Even ChatGPT Pro / Claude Max 30-day retention “likely invalidates your contracts with customers.” Near/far-east coding plans take privileged IP positions, caveat emptor.
The failure modes (why it can’t be trusted unsupervised)
- Two loop types: (1) repeating output forever (worse, happened mid customer-support work), (2) corrupting a file then refusing to give up, going “progressively off the rails.”
- Hallucinates filenames/tool calls when context fills (
~/faas-netesbecame~/faaned). - Arithmetic failures: 27.3K read as 273 000. Conflates “few functions” with “low usage” ignoring call frequency. Better at analysis than interpretation.
- Looping got worse after re-enabling thinking mode.
The concrete setup (verbatim, useful)
- llama.cpp built from source weekly; full 262144 context, f16 KV cache,
--parallel 1to keep full context (parallel 2 halves it). - Speculative decoding via MTP: ~93% acceptance, 67 tok/s to 130-200 tok/s sustained, “feels faster than cloud.”
- KV-cache quant rule of thumb: bad things at Q4_0 on keys; most aggressive safe = Q8_0 keys / Q4_0 values.
- vLLM was 3 tok/s slower than llama.cpp for single-user prosumer use, multi-minute load times, right for production batching, wrong here.
- Follow the model card’s temperature: base Qwen 0.6, Qwopus fine-tune wants thinking OFF and temp 0.85-1.0.
The hidden cost = operations, not tokens
- Comparing $/M tokens to GPT API pricing is the wrong comparison. Once a 2nd person uses it you inherit identity, access control, metering, quotas, model routing, power monitoring (Shelly plugs at the wall: RTX 6000 = 600W, dual 3090s = 750W and loud).
- Skip 70B (old/generations behind) and 35-A3B (only 3B active, fast on MacBook but lower quality). Bigger frontier OSS (GLM 5.2, Kimi 2.7, Deepseek V4) need 4-6 RTX 6000 cards.