Annotation Configs

Use human feedback to curate datasets for testing

Human feedback is often the most nuanced form of evaluation, capturing subtleties that automated methods miss. Even a small number of well-curated annotations can drive meaningful improvements.

Annotations are custom labels that can be added to represent this human feedback. They allow teams and subject matter experts to manually label data and curate high-quality datasets. Users can also log feedback directly using labeling queues or our annotations API.

Why are annotations critical?

Annotations enable deep error analysis, which is the first step toward writing meaningful evals and understanding where performance fall short.

  • A well-annotated dataset is essential for testing and refining eval templates.

  • Annotations also provide a structured way to capture human feedback that can be fed back into prompt optimization and fine-tuning.

  • By creating high-quality labeled data, annotations serve as a reliable ground truth.

For more on annotations, see Hamel’s Evals Blog.


What is an Annotation Config?

Annotation Configs allow you to define consistent annotation schemas that can be reused across your workspace, ensuring evaluations are structured and comparable over time.

To create a new annotation config, navigate to Annotation Configs in the sidebar and click New Annotation Config. You’ll then define four key elements:

  • Annotation Name: Provide a clear, descriptive name for your annotation. This helps others identify its purpose (ex: Correctness or Response Helpfulness).

  • Annotation Config Type: Choose how you want to capture feedback

    • Categorical Options – Assign predefined labels (e.g., Correct / Incorrect, Helpful / Unhelpful).

    • Continuous Score – Apply a numeric score or range to quantify performance (e.g., 0–1 for relevance).

    • Freeform Text – Enter open-ended feedback for qualitative evaluations.

  • Optimization Direction: Specify how the annotation is evaluated: Maximize when higher scores are better or Minimize when lower scores are better

  • Define Labels or Scores: Depending on your selected type, define the label categories or scoring range. For example: Correct (score = 1) and Incorrect (score = 0)

Add Annotations in the UI

Traces

Annotations can be applied at a per-span level for LLM use cases. Within the span, you can click the icon to annotate. From here, you can choose an existing annotation config or create a new one.

Experiments

You can annotate experiment results in Arize to capture human feedback. As you iterate and make system changes, this feedback serves as a strong signal for identifying improvements or regressions.

Create Annotations via API

Annotations can also be performed via our Python SDK using the log_annotations function to attach human feedback.

Logging the annotation

Important Prerequisite: Before logging annotations using the SDK, you must first configure the annotation within the Arize AX UI.

Import Packages and Setup Arize Client

import os
import pandas as pd
from arize.pandas.logger import Client

API_KEY = os.environ.get("ARIZE_API_KEY") # You can get this form the UI
SPACE_ID = os.environ.get("ARIZE_SPACE_ID") # You can get this form the UI
DEVELOPER_KEY = os.environ.get('ARIZE_DEVELOPER_KEY') # Needed for sync functions
PROJECT_NAME = "YOUR_PROJECT_NAME" # Replace with your project name

arize_client = Client(
        space_id=SPACE_ID,
        api_key=API_KEY,
        developer_key=DEVELOPER_KEY
)

response = arize_client.log_annotations(
        dataframe=annotations_dataframe,
        project_name=PROJECT_NAME,
        validate=True,  # Keep validation enabled
        verbose=True    # Enable detailed SDK logs, especially when first trying
    )
Annotations Dataframe Schema

The annotations_dataframe requires the following columns:

  1. context.span_id: The unique identifier of the span to which the annotations should be attached.

  2. Annotation Columns: Columns following the pattern annotation.<annotation_name>.<suffix>:

  • <annotation_name>: A name for your annotation (e.g., quality, correctness, sentiment). Should be alphanumeric characters and underscores.

  • <suffix>: Defines the type and metadata of the annotation. Valid suffixes are:

    • label: For categorical annotations (e.g., "good", "bad", "spam"). The value should be a string.

    • score: For numerical annotations (e.g., a rating from 1-5). The value should be numeric (int or float).

    • You must provide at least one annotation.<annotation_name>.label or annotation.<annotation_name>.score column for each annotation you want to log.

    • updated_by (Optional): A string indicating who made the annotation (e.g., "user_id_123", "annotator_team_a"). If not provided, the SDK automatically sets this to "SDK Logger".

    • updated_at (Optional): A timestamp indicating when the annotation was made, represented as milliseconds since the Unix epoch (integer). If not provided, the SDK automatically sets this to the current UTC time.

  • annotation.notes (Optional): A column containing free-form text notes that apply to the entire span, not a specific annotation label/score. The value should be a string.

An example annotation data dictionary would look like:

# Assume TARGET_SPAN_ID holds the ID of the span you want to annotate
TARGET_SPAN_ID = "3461a49d-e0c3-469a-837b-d83f4a606543"

annotation_data = {
    "context.span_id": [TARGET_SPAN_ID],
    # Annotation 1: Categorical label, let SDK autogenerate updated_by/updated_at
    "annotation.quality.label": ["good"],
    # Annotation 2: Categorical label, manually set updated_by
    "annotation.relevance.label": ["relevant"],
    "annotation.relevance.updated_by": ["human_annotator_1"],
    # Annotation 3: Numerical score, let SDK autogenerate updated_by/updated_at
    "annotation.sentiment_score.score": [4.5],
    # Optional notes for the span
    "annotation.notes": ["User confirmed the summary was helpful."],
}
annotations_dataframe = pd.DataFrame(annotation_data)

Last updated

Was this helpful?