How I write software with LLMs

March 16, 2026 Source

Summary

A practitioner describes a structured multi-agent LLM workflow for producing high-quality software at scale: Opus 4.6 as architect, Sonnet 4.6 as developer, and Codex/Gemini as independent reviewers. The key finding is that using different models for review catches more issues than one model reviewing its own output. The author has maintained multi-week projects with tens of thousands of lines using this approach, with defect rates reportedly lower than hand-written code.

Key Insight

Architecture skills matter more now, not less. The skill shift is from “write correct code” to “make correct architectural decisions.” Engineers who understand system design and can steer LLMs away from bad choices still produce dramatically better results than non-engineers using the same tools.
Multi-model review beats single-model review. Codex 5.4 is pedantic (good for review), Opus makes decisions that align well with human judgment, Gemini Flash sometimes finds solutions other models miss. Mixing models catches different failure modes.
Architect → Developer → Reviewer pipeline rationale:
Use expensive model (Opus) only for planning - tokens are spent on decisions, not code generation
Developer (Sonnet) gets a low-ambiguity plan, minimising bad choices
Reviewers use models with no stake in the original code - they disagree more usefully
The “approved” keyword pattern. Author explicitly instructed the architect not to start implementation until the word “approved” is used - prevents eager models from jumping ahead before the human is satisfied with the plan.
Skill files written by hand. Author finds asking LLMs to write their own instruction files ineffective - analogous to telling someone to write a guide on how to be a great engineer and then handing it back to them.
Failure mode identified: When unfamiliar with the underlying technology, the human can’t catch bad architectural decisions early. The LLM then builds on flawed foundations until the codebase becomes unmaintainable. The loop looks like: “LLM says I know why, let me fix it” → keeps breaking more things.
Harness requirements: Must support multiple companies’ models and agents calling each other autonomously. Single-company tools (Claude Code, Codex CLI, Gemini CLI) fail the first requirement.
HN counterpoint worth noting: Multiple commenters question whether the productivity gains hold up objectively, and one points out the Stavrobot source code quality was not impressive on inspection. Take the “lower defect rate than hand-written code” claim as anecdotal.