Human Annotations

Use human feedback to curate datasets for testing

Annotations are custom labels that can be added to traces in LLM applications. AI engineers can use annotations to manually label data, curate datasets for experimentation, and use human feedback instead of code or LLM evals.

Annotations are great for finding examples where LLM evals and humans disagree for further review. Often times, subject matter experts (e.g. a doctor or lawyer) are needed to determine correctness of an answer. Customers also log feedback directly using labeling queues or our annotations API.

Adding an annotation to a span

Key features

Add annotations in the UI

Annotations are labels that can be applied at a per-span level for LLM use cases. Annotations are defined by a config for the annotation (label, score). Those annotations are then available for any future annotation on the model.

Unstructured text annotations (notes) can also be continuously added.

Viewing Annotations on a Trace

Users can save and view annotations on a trace and also filter on them.

Create Annotations via API

Annotations can also be performed via our Python SDK. Use the log_annotations_sync function as part of our Python SDK to attach human feedback, corrections, or other annotations to specific spans (traces). The code below assumes that you have annotation data available in an annotations_dataframe object. It also assumes you have the relevant context.span_id for the span you want to annotate.

Note: Annotations can be applied on spans up to 14 days prior to the current day. To apply annotations beyond this lookback window, please reach out to support@arize.com

Logging the annotation

Important Prerequisite: Before logging annotations using the SDK, you must first configure the annotation definition within the Arize application UI.

  1. Navigate to a trace within your project in the Arize platform.

  2. Click the "Annotate" button to open the annotation panel.

  3. Click "Add Annotation".

  4. Define the <annotation_name> exactly as you will use it in your SDK code.

  5. Select the appropriate Type (Label for categorical strings, Score for numerical values).

  6. If using Label, define the specific allowed label strings that the SDK can send for this annotation name. If using Score, you can optionally define score ranges.

Note: Only annotations matching a pre-configured (1) name and (2) type/labels in the UI can be successfully logged via the SDK.

Here is how you can log your annotations in real-time to the Arize platform with the python SDK:

Import Packages and Setup Arize Client

import os
import pandas as pd
from arize.pandas.logger import Client

API_KEY = os.environ.get("ARIZE_API_KEY") # You can get this form the UI
SPACE_ID = os.environ.get("ARIZE_SPACE_ID") # You can get this form the UI
DEVELOPER_KEY = os.environ.get('ARIZE_DEVELOPER_KEY') # Needed for sync functions
PROJECT_NAME = "YOUR_PROJECT_NAME" # Replace with your project name

print(f"\n🚀 Initializing Arize client for space '{SPACE_ID}'...")
try:
    arize_client = Client(
        space_id=SPACE_ID,
        api_key=API_KEY,
        developer_key=DEVELOPER_KEY
    )
    print("✅ Arize client initialized.")
except Exception as e:
    print(f"❌ Error initializing client: {e}")
    exit()

Create Sample Data (replace with your actual data)

TARGET_SPAN_ID = "3461a49d-e0c3-469a-837b-d83f4a606543" # Replace with your span ID
annotation_data = {
    "context.span_id": [TARGET_SPAN_ID],
    "annotation.quality.label": ["good"],
    "annotation.relevance.label": ["relevant"],
    "annotation.relevance.updated_by": ["human_annotator_1"],
    "annotation.sentiment_score.score": [4.5],
    "annotation.notes": ["User confirmed the summary was helpful."],
}
annotations_df = pd.DataFrame(annotation_data)

Log Annotation

try:
    response = arize_client.log_annotations(
        dataframe=annotations_df,
        project_name=PROJECT_NAME,
        validate=True,  # Keep validation enabled
        verbose=True    # Enable detailed SDK logs, especially when first trying
    )

    if response:
        print("\n✅ Successfully logged annotations!")
        print(f"   Annotation Records Updated: {response.records_updated}")
    else:
        print("\n⚠️ Annotation logging call completed, but no response received (check SDK logs/platform).")

except Exception as e:
    print(f"\n❌ An error occurred during annotation logging: {e}") 

Annotations Dataframe Schema

The annotations_dataframe requires the following columns:

  1. context.span_id: The unique identifier of the span to which the annotations should be attached.

  2. Annotation Columns: Columns following the pattern annotation.<annotation_name>.<suffix> where:

  • <annotation_name>: A name for your annotation (e.g., quality, correctness, sentiment). Should be alphanumeric characters and underscores.

  • <suffix>: Defines the type and metadata of the annotation. Valid suffixes are:

    • You must provide at least one annotation.<annotation_name>.label or annotation.<annotation_name>.score column for each annotation you want to log.

    • label: For categorical annotations (e.g., "good", "bad", "spam"). The value should be a string.

    • score: For numerical annotations (e.g., a rating from 1-5). The value should be numeric (int or float).

    • updated_by (Optional): A string indicating who made the annotation (e.g., "user_id_123", "annotator_team_a"). If not provided, the SDK automatically sets this to "SDK Logger".

    • updated_at (Optional): A timestamp indicating when the annotation was made, represented as milliseconds since the Unix epoch (integer). If not provided, the SDK automatically sets this to the current UTC time.

  • annotation.notes (Optional): A column containing free-form text notes that apply to the entire span, not a specific annotation label/score. The value should be a string. The SDK will handle formatting this correctly for storage.

An example annotation data dictionary would look like:

# Assume TARGET_SPAN_ID holds the ID of the span you want to annotate
TARGET_SPAN_ID = "3461a49d-e0c3-469a-837b-d83f4a606543"

annotation_data = {
    "context.span_id": [TARGET_SPAN_ID],
    # Annotation 1: Categorical label, let SDK autogenerate updated_by/updated_at
    "annotation.quality.label": ["good"],
    # Annotation 2: Categorical label, manually set updated_by
    "annotation.relevance.label": ["relevant"],
    "annotation.relevance.updated_by": ["human_annotator_1"],
    # Annotation 3: Numerical score, let SDK autogenerate updated_by/updated_at
    "annotation.sentiment_score.score": [4.5],
    # Optional notes for the span
    "annotation.notes": ["User confirmed the summary was helpful."],
}
annotations_df = pd.DataFrame(annotation_data)
Annotations API

Labeling Queues

Labeling queues are sets of data you would like subject matter experts/3rd parties to label or score on any criteria you specify. You can use these annotations to create golden datasets from experts for fine tuning, and find examples where LLM evals and humans disagree.

What you need to use Labeling queues is:

  1. A dataset you want to annotate

  2. Annotator users in your space

    1. Note: you can assign annotators OR members in your space to a labeling queue. Annotators will see a restricted view of the platform (see below)

  3. Annotation criteria

Inviting an Annotator

In the settings page, you can invite your annotators by adding them as users with the account role as Annotator. They will receive an email to be added to your space and set their password.

Creating a Labeling Queue

After you have created a dataset of traces you want to evaluate, you can create a labeling queue and distribute them to your annotation team. Then, you can view your records and annotations provided.

The columns that annotators label will appear on datasets as name spaced annotationcolumns (i.e. annotation.hallucination). The latest annotation value for a specific row will be namespaced with latest.userannotation, which can be helpful to use for experiments if you have multiple annotators labeling a dataset.

Labeling data as an annotator

Annotators see the labeling queues they have been assigned, and the data they need to annotate, along with the label or score they need to provide in the top right. Your datasets can contain text, images, and links. Annotators can leave notes, and use the keyboard shortcuts to provide annotations faster.

Last updated

Was this helpful?