Why evaluate
LLM applications fail in ways traditional software doesn’t. The common failure modes:| Failure mode | What it looks like |
|---|---|
| Hallucination | The model confidently generates information that isn’t true. |
| Incorrect reasoning | The answer is wrong even though it sounds plausible. |
| Retrieval failure | The system pulls the wrong context from the vector store or knowledge base. |
| Poor tool usage | Wrong tool selected, wrong parameters passed, or a needed tool is missing entirely. |
| Prompt regression | A small prompt change silently degrades quality across the population. |
| Model update drift | A provider upgrade changes behavior in ways tests don’t catch. |
- Safe prompt iteration. Change a prompt and see whether eval scores improved or regressed across thousands of traces, not three you read by hand.
- Model comparison. Try a new model and measure the delta against your existing one on the same population.
- Regression detection. Catch silent quality drops the moment they show up in production.
- Production monitoring. Track quality over time alongside latency and cost.
- Continuous improvement. Feed labeled failures back into prompts, fine-tuning data, and the evaluators themselves.
The two-cycle improvement loop
The whole point of evaluators is to enable an improvement cycle. There are actually two cycles, not one — and they share data.