LLM Evals: Everything You Need to Know

Curated June 21, 2026 2 min read

llm-evalserror-analysisllm-as-judgerag

My notes

Summary

A sharp-opinion FAQ distilled from teaching LLM evals to 700+ engineers and PMs. The core thesis: stop buying generic eval metrics and tooling, and instead do error analysis - manually read your own traces, build a failure taxonomy, and only then write targeted evaluators. Evaluation is a human-driven sensemaking process, not a dashboard you switch on.

Key Insight

Error analysis is the whole game. On real projects the authors spent 60-80% of dev time on error analysis and evaluation, not building automated checks. Skipping it produces “counter-productive generic metrics”.
Read ~100 traces, stop at saturation. Heuristic: if ~20 new traces turn up no new failure category, you can stop (but review at least 100 first). Process = open coding (journal notes on first failure per trace) then axial coding (group into a failure taxonomy and count).
A 100% eval pass rate is a red flag - you are not stress-testing. A 70% pass rate often signals a more meaningful eval.
Binary pass/fail beats 1-5 Likert. Likert points (3 vs 4) are subjective, need bigger samples, and annotators hide uncertainty in the middle. Decompose into sub-component binary checks (e.g. “4 of 5 expected facts present”) to track gradual progress.
Don’t do eval-driven development (writing evals before features) - LLMs have infinite failure surface you can’t anticipate. Exception: hard known constraints like “never mention competitors”.
Generic / off-the-shelf metrics waste time and create false confidence. BERTScore, ROUGE, cosine similarity, “helpfulness/coherence/quality” don’t tell you if your system works. Only reuse them as exploration signals to surface interesting traces.
One “benevolent dictator” domain expert beats a committee - eliminates annotation conflicts. Needing 5 SMEs to judge one interaction means your product scope is too broad. Use Cohen’s Kappa only when multiple annotators are unavoidable.
Build a custom annotation tool (Cursor/Lovable, hours of work) - teams with one iterate ~10x faster. Render traces in domain-native form (emails as emails, code with syntax highlighting), ask “Was the appointment booked?” not “Did the tool call succeed?”.
Synthetic data: use structured dimensions, two-step. Define dimensions (e.g. dietary restriction x cuisine x complexity), hand-write 20 tuples, then (1) generate more tuples, (2) convert tuples to natural-language queries in a separate prompt. Avoids generic repetitive output. Unreliable for high-stakes/specialized/low-resource domains.
LLM-as-judge is expensive - needs 100+ labeled examples plus weekly upkeep. Reserve it for persistent failures; use cheap regex/assertions/execution tests first. Validate the judge by TPR and TNR on a held-out human-labeled set; same model as the main task is usually fine for scoped binary judging.
Don’t outsource error analysis - it breaks the failure-to-fix feedback loop. Mechanical/context-free tasks (translation, format validation) can be outsourced after the rubric exists.
LLMs can assist, not replace: good for first-pass axial coding (after you’ve coded 30-50 yourself), mapping annotations to known failure modes, analyzing label patterns. Never delegate initial open coding, taxonomy validation, or ground-truth labeling.
“RAG is dead” is marketing. The viral claim only targets naive vector retrieval for coding agents. Evaluate retrieval and generation separately (see Jason Liu’s “6 RAG Evals”: C|Q context relevance, A|C faithfulness, A|Q answer relevance).
Abstention: test “knowing what it doesn’t know” with a balanced set of answerable + unanswerable (false-premise) questions; pass = answers the answerable AND refuses the unanswerable.