Measuring Quality at Scale
Your SupportBot is now instrumented with complete tracing. You can see every LLM call, tool invocation, and retrieval operation. But there’s a problem: Users are still complaining. Some responses are helpful, others are completely wrong. Visibility alone isn’t enough. You need to measure quality. Which traces represent good responses? Which represent failures? And most importantly, how do you identify patterns across thousands of interactions? This chapter teaches you two approaches to measuring quality:- Human Annotations - Review traces and add labels directly in the Arize AX UI
- Automated Evaluators - Scale measurement using code-based heuristics or LLM-as-Judge
Follow with Complete Python Notebook
The Challenge
Manually reviewing traces doesn’t scale. If you process 10,000 queries per day, you can’t review them all. But you can:- Review a sample to create ground truth
- Run automated evaluators to identify patterns at scale
Annotate in the Arize AX UI
The easiest way to start is annotating traces directly in the UI. Navigate to any trace, open the annotation panel, and add labels, scores, or freeform notes — no code required. Arize AX supports three annotation types: Categorical - For yes/no or multi-class labelsAutomated Evaluations
Manual review doesn’t scale. To evaluate thousands of traces automatically, export your spans and run heuristic or LLM-as-Judge evaluators against them.Tool Result Evaluation
A simple code-based evaluator that checks whether tool calls succeeded:When writing LLM-as-Judge evaluators, wrap the evaluator’s LLM calls with
suppress_instrumentation() to prevent them from appearing as application
traces.Key Takeaways
You now have two layers of quality measurement:- Manual annotations via the UI for ground truth
- Automated evaluators for systematic pattern identification at scale
What’s Next?
You can now measure quality for individual queries. But what about conversations? Multi-turn interactions need session-level analysis. In the next chapter, Sessions, you’ll learn how to:- Group related traces into conversations
- Track context across multiple turns
- Identify where multi-turn interactions break down