Annotations - Phoenix

Annotations are structured feedback records — labels, scores, and explanations — attached to observability artifacts in Phoenix. They are the primary mechanism for recording quality signals, whether those signals come from end-users, LLM judges, or programmatic checks.

Relevant Source Files

src/types/annotations.ts for the shared Annotation base interface
src/spans/types.ts for SpanAnnotation and DocumentAnnotation
src/sessions/types.ts for SessionAnnotation
src/traces/types.ts for TraceAnnotation

Why Annotations Matter

Annotations close the feedback loop on your LLM application:

Human feedback — Thumbs-up/down from end-users, QA reviews from teammates, or labeling tasks for dataset curation.
LLM-as-judge evaluations — Automated quality scoring using a second LLM (groundedness, helpfulness, safety).
Code-based metrics — Programmatic checks like regex validation, threshold comparisons, or retrieval precision calculations.

Once attached, annotations appear in the Phoenix UI alongside traces and can be used to filter spans, build datasets, and track improvements during experimentation.

Annotation Targets

Phoenix supports four annotation targets, each focused on a different level of your application:

Span Annotations — Feedback on individual traced operations: an LLM call, a tool invocation, a retrieval step. The most common annotation target.
Document Annotations — Feedback on specific retrieved documents within a retriever span, indexed by position. Essential for evaluating RAG pipeline quality.
Session Annotations — Feedback on multi-turn conversations or threads as a whole. Use for conversation-level quality signals like resolution rate or customer satisfaction.
Trace Annotations — Feedback attached to a single trace, identified by its trace ID. Use addTraceAnnotation or logTraceAnnotations from @arizeai/phoenix-client/traces when scoring an end-to-end request. Reach for session annotations instead when scoring a multi-turn conversation as a whole.

Annotator Kinds

Every annotation records who or what produced the feedback:

Kind	Default	Use case
`"HUMAN"`	Yes	Manual review, end-user thumbs-up/down, labeling tasks
`"LLM"`	—	LLM-as-judge evaluations, automated quality scoring
`"CODE"`	—	Programmatic rules, regex checks, threshold-based metrics

Shared Annotation Shape

All annotation types share this base interface:

interface Annotation {
  name: string;                        // What is being measured (e.g. "groundedness")
  label?: string;                      // Categorical result (e.g. "grounded")
  score?: number;                      // Numeric result (e.g. 0.95)
  explanation?: string;                // Free-text justification
  identifier?: string;                 // For idempotent upserts
  metadata?: Record<string, unknown>;  // Arbitrary context
}

At least one of label, score, or explanation must be provided. Each target adds its own identifier field — spanId for spans, spanId + documentPosition for documents, traceId for traces, and sessionId for sessions.

Sync vs. Async

All annotation write functions accept an optional sync parameter:

sync: false (default) — The server acknowledges receipt and processes the annotation asynchronously. Higher throughput, but the response does not include the annotation ID.
sync: true — The server processes the annotation synchronously and returns its ID. Useful in tests or workflows that need to read the annotation back immediately.

Source Map

src/types/annotations.ts
src/spans/addSpanAnnotation.ts
src/spans/logSpanAnnotations.ts
src/spans/addDocumentAnnotation.ts
src/spans/logDocumentAnnotations.ts
src/spans/getSpanAnnotations.ts
src/sessions/addSessionAnnotation.ts
src/sessions/logSessionAnnotations.ts
src/traces/addTraceAnnotation.ts
src/traces/logTraceAnnotations.ts

​Relevant Source Files

​Why Annotations Matter

​Annotation Targets

​Annotator Kinds

​Shared Annotation Shape

​Sync vs. Async

​Source Map