# LLM Evals: Everything You Need to Know

> A sharp-opinion FAQ on LLM evals: skip generic metrics and tooling, do error analysis on your own traces, build a failure taxonomy, then write targeted evaluators.

Published: 2026-06-21
URL: https://daniliants.com/insights/llm-evals-everything-you-need-to-know/
Tags: llm-evals, error-analysis, llm-as-judge, rag

---

## Summary

A sharp-opinion FAQ distilled from teaching LLM evals to 700+ engineers and PMs. The core thesis: stop buying generic eval metrics and tooling, and instead do error analysis - manually read your own traces, build a failure taxonomy, and only then write targeted evaluators. Evaluation is a human-driven sensemaking process, not a dashboard you switch on.

## Key Insight

- **Error analysis is the whole game.** On real projects the authors spent 60-80% of dev time on error analysis and evaluation, not building automated checks. Skipping it produces "counter-productive generic metrics".
- **Read ~100 traces, stop at saturation.** Heuristic: if ~20 new traces turn up no new failure category, you can stop (but review at least 100 first). Process = open coding (journal notes on first failure per trace) then axial coding (group into a failure taxonomy and count).
- **A 100% eval pass rate is a red flag** - you are not stress-testing. A 70% pass rate often signals a more meaningful eval.
- **Binary pass/fail beats 1-5 Likert.** Likert points (3 vs 4) are subjective, need bigger samples, and annotators hide uncertainty in the middle. Decompose into sub-component binary checks (e.g. "4 of 5 expected facts present") to track gradual progress.
- **Don't do eval-driven development** (writing evals before features) - LLMs have infinite failure surface you can't anticipate. Exception: hard known constraints like "never mention competitors".
- **Generic / off-the-shelf metrics waste time and create false confidence.** BERTScore, ROUGE, cosine similarity, "helpfulness/coherence/quality" don't tell you if your system works. Only reuse them as exploration signals to surface interesting traces.
- **One "benevolent dictator" domain expert** beats a committee - eliminates annotation conflicts. Needing 5 SMEs to judge one interaction means your product scope is too broad. Use Cohen's Kappa only when multiple annotators are unavoidable.
- **Build a custom annotation tool** (Cursor/Lovable, hours of work) - teams with one iterate ~10x faster. Render traces in domain-native form (emails as emails, code with syntax highlighting), ask "Was the appointment booked?" not "Did the tool call succeed?".
- **Synthetic data: use structured dimensions, two-step.** Define dimensions (e.g. dietary restriction x cuisine x complexity), hand-write 20 tuples, then (1) generate more tuples, (2) convert tuples to natural-language queries in a separate prompt. Avoids generic repetitive output. Unreliable for high-stakes/specialized/low-resource domains.
- **LLM-as-judge is expensive** - needs 100+ labeled examples plus weekly upkeep. Reserve it for persistent failures; use cheap regex/assertions/execution tests first. Validate the judge by TPR and TNR on a held-out human-labeled set; same model as the main task is usually fine for scoped binary judging.
- **Don't outsource error analysis** - it breaks the failure-to-fix feedback loop. Mechanical/context-free tasks (translation, format validation) can be outsourced after the rubric exists.
- **LLMs can assist, not replace:** good for first-pass axial coding (after you've coded 30-50 yourself), mapping annotations to known failure modes, analyzing label patterns. Never delegate initial open coding, taxonomy validation, or ground-truth labeling.
- **"RAG is dead" is marketing.** The viral claim only targets naive vector retrieval for coding agents. Evaluate retrieval and generation separately (see Jason Liu's "6 RAG Evals": C|Q context relevance, A|C faithfulness, A|Q answer relevance).
- **Abstention:** test "knowing what it doesn't know" with a balanced set of answerable + unanswerable (false-premise) questions; pass = answers the answerable AND refuses the unanswerable.