Annotate Traces

Your automated evals say a response is “grounded” — but is it really? Sometimes you need a human to weigh in. Annotations let your team add ground-truth labels and scores directly on spans, building a feedback loop between humans and your AI.

How to do it

Open a trace and click into any span
Click the Annotate toggle in the span toolbar
Select an annotation config (e.g., “Correctness”, “Helpfulness”) or create a new one
Add your label or score — saves automatically

[screenshot: annotation panel open next to a span with label selector]

Span-level and trace-level annotations

Annotations can apply to either a single span or an entire trace. Use a span-level annotation when you are judging one operation, such as whether a retrieved document was relevant. Use a trace-level annotation when the judgment concerns the complete request or conversation, such as end-to-end response quality.

Annotation configs

Configs define the schema for your labels. Shared across the project so everyone uses the same schema.

Categorical — fixed labels (e.g., “correct”, “incorrect”, “partially correct”)
Continuous — numeric scores on a range (e.g., 1–5)

Create new configs on the fly from the annotation panel: click + New Config, choose type, add options, save.

Annotation notes

In addition to labels and scores, you can attach free-text notes to any annotation. Notes are useful for explaining edge cases, providing context for disagreements, or flagging spans for follow-up discussion.

Measure eval quality with annotations

Use annotations as ground truth to measure how well your automated evals perform:

SELECT
    PRECISION(
        predicted = "eval.Groundedness.label",
        actual = "annotation.Human Groundedness.label",
        pos_class = 'grounded'
    )
FROM model

Annotations vs. evals

	Annotations	Evals
Who	Humans	Automated (LLM-as-judge or code)
Scale	Small samples	Every span
Best for	Ground truth, calibration	Production monitoring

Use both: evals for scale, annotations for accuracy.

⌘I

Quickstart

Instrument

Observe

Evaluate

Improve

Agents

Machine Learning

Settings

Security

How to do it

Span-level and trace-level annotations

Annotation configs

Annotation notes

Measure eval quality with annotations

Annotations vs. evals

​How to do it

​Span-level and trace-level annotations

​Annotation configs

​Annotation notes

​Measure eval quality with annotations

​Annotations vs. evals

How to do it

Span-level and trace-level annotations

Annotation configs

Annotation notes

Measure eval quality with annotations

Annotations vs. evals