LLM Evals: Everything You Need to Know
2 min read
Originally from hamel.dev
View source
My notes
Summary
A sharp-opinion FAQ distilled from teaching LLM evals to 700+ engineers and PMs. The core thesis: stop buying generic eval metrics and tooling, and instead do error analysis - manually read your own traces, build a failure taxonomy, and only then write targeted evaluators. Evaluation is a human-driven sensemaking process, not a dashboard you switch on.
Key Insight
- Error analysis is the whole game. On real projects the authors spent 60-80% of dev time on error analysis and evaluation, not building automated checks. Skipping it produces “counter-productive generic metrics”.
- Read ~100 traces, stop at saturation. Heuristic: if ~20 new traces turn up no new failure category, you can stop (but review at least 100 first). Process = open coding (journal notes on first failure per trace) then axial coding (group into a failure taxonomy and count).
- A 100% eval pass rate is a red flag - you are not stress-testing. A 70% pass rate often signals a more meaningful eval.
- Binary pass/fail beats 1-5 Likert. Likert points (3 vs 4) are subjective, need bigger samples, and annotators hide uncertainty in the middle. Decompose into sub-component binary checks (e.g. “4 of 5 expected facts present”) to track gradual progress.
- Don’t do eval-driven development (writing evals before features) - LLMs have infinite failure surface you can’t anticipate. Exception: hard known constraints like “never mention competitors”.
- Generic / off-the-shelf metrics waste time and create false confidence. BERTScore, ROUGE, cosine similarity, “helpfulness/coherence/quality” don’t tell you if your system works. Only reuse them as exploration signals to surface interesting traces.
- One “benevolent dictator” domain expert beats a committee - eliminates annotation conflicts. Needing 5 SMEs to judge one interaction means your product scope is too broad. Use Cohen’s Kappa only when multiple annotators are unavoidable.
- Build a custom annotation tool (Cursor/Lovable, hours of work) - teams with one iterate ~10x faster. Render traces in domain-native form (emails as emails, code with syntax highlighting), ask “Was the appointment booked?” not “Did the tool call succeed?”.
- Synthetic data: use structured dimensions, two-step. Define dimensions (e.g. dietary restriction x cuisine x complexity), hand-write 20 tuples, then (1) generate more tuples, (2) convert tuples to natural-language queries in a separate prompt. Avoids generic repetitive output. Unreliable for high-stakes/specialized/low-resource domains.
- LLM-as-judge is expensive - needs 100+ labeled examples plus weekly upkeep. Reserve it for persistent failures; use cheap regex/assertions/execution tests first. Validate the judge by TPR and TNR on a held-out human-labeled set; same model as the main task is usually fine for scoped binary judging.
- Don’t outsource error analysis - it breaks the failure-to-fix feedback loop. Mechanical/context-free tasks (translation, format validation) can be outsourced after the rubric exists.
- LLMs can assist, not replace: good for first-pass axial coding (after you’ve coded 30-50 yourself), mapping annotations to known failure modes, analyzing label patterns. Never delegate initial open coding, taxonomy validation, or ground-truth labeling.
- “RAG is dead” is marketing. The viral claim only targets naive vector retrieval for coding agents. Evaluate retrieval and generation separately (see Jason Liu’s “6 RAG Evals”: C|Q context relevance, A|C faithfulness, A|Q answer relevance).
- Abstention: test “knowing what it doesn’t know” with a balanced set of answerable + unanswerable (false-premise) questions; pass = answers the answerable AND refuses the unanswerable.