Local Qwen isn't a worse Opus, it's a different tool

Curated June 17, 2026 2 min read

local-llmself-hostingcoding-agentsquantization

My notes

Summary

A bootstrapped infra founder runs local Qwen 3.6 27B on a 12 000 USD RTX 6000 Pro and reports honestly: local models are NOT “near-Opus” for writing code, but they pay for themselves on a narrow band of work, privacy-sensitive analysis (customer telemetry, support dumps) where cloud models are contractually off-limits. The real wins are reading/explaining codebases and bounded analysis, not autonomous long-horizon coding, where the model loops and burns 600W for half an hour.

Key Insight

The benchmark gap is bigger than it looks

Qwen 3.6 27B scores 77.2 on SWE-Bench Verified vs Opus 4.8 at 88.6, the “only 12% behind” framing is misleading.
Benchmark is Python (threads/async); real-world Go (channels, contexts, distributed) exposes the gap immediately. Benchmaxxing is real.
Frontier models are 0.5-2T params; a 27B dense model is “on a different level,” not marginally smaller.

Where local actually earned its keep

Renewal audit: feeding a telemetry DB into the local model found a customer under-paying licenses by 4-5x for 12 months. That recovery alone paid for the card.
“diag” CLI dumps from enterprise customers run through an airgapped local model in an ephemeral Slicer VM, no customer data leaves premises.
Even ChatGPT Pro / Claude Max 30-day retention “likely invalidates your contracts with customers.” Near/far-east coding plans take privileged IP positions, caveat emptor.

The failure modes (why it can’t be trusted unsupervised)

Two loop types: (1) repeating output forever (worse, happened mid customer-support work), (2) corrupting a file then refusing to give up, going “progressively off the rails.”
Hallucinates filenames/tool calls when context fills (~/faas-netes became ~/faaned).
Arithmetic failures: 27.3K read as 273 000. Conflates “few functions” with “low usage” ignoring call frequency. Better at analysis than interpretation.
Looping got worse after re-enabling thinking mode.

The concrete setup (verbatim, useful)

llama.cpp built from source weekly; full 262144 context, f16 KV cache, --parallel 1 to keep full context (parallel 2 halves it).
Speculative decoding via MTP: ~93% acceptance, 67 tok/s to 130-200 tok/s sustained, “feels faster than cloud.”
KV-cache quant rule of thumb: bad things at Q4_0 on keys; most aggressive safe = Q8_0 keys / Q4_0 values.
vLLM was 3 tok/s slower than llama.cpp for single-user prosumer use, multi-minute load times, right for production batching, wrong here.
Follow the model card’s temperature: base Qwen 0.6, Qwopus fine-tune wants thinking OFF and temp 0.85-1.0.

The hidden cost = operations, not tokens

Comparing $/M tokens to GPT API pricing is the wrong comparison. Once a 2nd person uses it you inherit identity, access control, metering, quotas, model routing, power monitoring (Shelly plugs at the wall: RTX 6000 = 600W, dual 3090s = 750W and loud).
Skip 70B (old/generations behind) and 35-A3B (only 3B active, fast on MacBook but lower quality). Bigger frontier OSS (GLM 5.2, Kimi 2.7, Deepseek V4) need 4-6 RTX 6000 cards.