Skip to main content

Begin with human judgment

Automated evals are only as good as your understanding of what actually matters. Start by reviewing real interactions in your tracing project, identifying failure patterns, and grouping them into a taxonomy. The labels you collect become ground truth, and that process tells you which evals to build. When you are ready to automate, see create evaluators.
Arize AX Annotation Configs page with a table of reusable configs showing names, label values as colored pills, created by, timestamps, tags, and New Annotation Config in the header

What is an annotation

An annotation is a human label attached to a span, dataset example, or experiment result. It can be a category (e.g. Correct / Incorrect), a numeric score (e.g. 0-1), or freeform text feedback. Annotation configs define reusable schemas for these labels, keeping evaluations consistent and comparable over time. To add your first annotation config, navigate to Annotation Configs in the left navigation and click New Annotation Config. You’ll define:
  • Name: a clear label for the annotation (e.g. “Correctness”)
  • Type: categorical, numeric score, or freeform text
  • Optimization direction: Set to maximize if a higher score is better (e.g. accuracy), or minimize if a lower score is better (e.g. error rate). This determines how scores are color-coded in the UI.
  • Labels and score range: e.g. Correct (1) / Incorrect (0)
Annotation Config

Annotate your spans

There are several ways to review and annotate your spans.
Use the Arize skills plugin in your coding agent to manage annotation configs and apply annotations without leaving your editor. See the full arize-annotation skill documentation for supported commands. Then ask your agent:
  • “Create a categorical annotation config called Correctness with correct/incorrect labels”
  • “List all annotation configs in my space”
  • “Bulk annotate these spans with their correctness labels”
Terminal showing ax annotation-configs create success for Correctness_new, list commands for a space with table and JSON output, and a summary table of annotation config names, types, labels, and IDs
For routed review workflows and curating labeled examples into a benchmark dataset, see Labeling Queues.

What’s next

To automate quality checks, create evaluators. If you’d prefer additional human review at scale, see create a labeling queue.

Further reading