> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Human review

> Annotation configs, labeling workflows, and structured feedback on traces and datasets—turn labels into ground truth for evals.

## Begin with human judgment

Automated evals are only as good as your understanding of what actually matters. Start by reviewing real interactions in your [tracing project](/ax/observe/tracing), identifying failure patterns, and grouping them into a taxonomy. The labels you collect become ground truth, and that process tells you which evals to build. When you are ready to automate, see [create evaluators](/ax/evaluate/create-evaluators).

<Frame caption="Annotation Configs">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/human%20judgement.png" alt="Arize AX Annotation Configs page with a table of reusable configs showing names, label values as colored pills, created by, timestamps, tags, and New Annotation Config in the header" />
</Frame>

## What is an annotation

An annotation is a human label attached to a span, dataset example, or experiment result. It can be a category (e.g. Correct / Incorrect), a numeric score (e.g. 0-1), or freeform text feedback. Annotation configs define reusable schemas for these labels, keeping evaluations consistent and comparable over time.

To add your first annotation config, navigate to **Annotation Configs** in the left navigation and click **New Annotation Config**. You'll define:

* **Name:** a clear label for the annotation (e.g. "Correctness")
* **Type:** categorical, numeric score, or freeform text
* **Optimization direction:** Set to **maximize** if a higher score is better (e.g. accuracy), or **minimize** if a lower score is better (e.g. error rate). This determines how scores are color-coded in the UI.
* **Labels and score range:** e.g. Correct (1) / Incorrect (0)

<Frame>
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/annotation_config.png" alt="Annotation Config" />
</Frame>

<h2 id="annotate-your-spans">
  Annotate your spans
</h2>

There are several ways to review and annotate your spans.

<Tabs>
  <Tab title="By Arize Skills">
    Use the [Arize skills plugin](/ax/agents/arize-skills) in your coding agent to manage annotation configs and apply annotations without leaving your editor. See the full [arize-annotation skill documentation](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-annotation/SKILL.md) for supported commands. Then ask your agent:

    * "Create a categorical annotation config called Correctness with correct/incorrect labels"
    * "List all annotation configs in my space"
    * "Bulk annotate these spans with their correctness labels"

    ![Coding agent terminal using the Arize skills plugin to create annotation configs with the ax CLI](https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/config-new.png)
  </Tab>

  <Tab title="By Alyx">
    Use [Alyx](/ax/alyx/meet-alyx) to help you find common error patterns across your traces. From there you can ask Alyx to annotate spans directly:

    * "Show me the most common failure patterns in my traces"
    * "Create an annotation config capturing good and bad responses"
    * "Annotate spans where the output looks incorrect — a good response is factually accurate and directly answers the user's question, a bad response is vague, hallucinated, or off-topic"

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/config_alyx.png" alt="Eval Traces view with Alyx in the side panel analyzing errors and proposing an annotation config to label data" />
    </Frame>
  </Tab>

  <Tab title="By UI">
    Open the [Spans](/ax/observe/tracing/spans) view and review real outputs. Optionally use filters to focus on a specific span kind, time range, or status. To annotate a span, click the annotate button, and select your config.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/trace_config.png" alt="Trace detail view with span input and output and the Annotations panel open to select correctness labels" />
    </Frame>
  </Tab>

  <Tab title="By Code">
    Apply annotations via the Python SDK to attach human feedback programmatically.

    <Danger>
      Note: Annotations can be applied on spans up to 31 days prior to the current day. To apply annotations beyond this lookback window, please reach out to [support@arize.com](mailto:support@arize.com)
    </Danger>

    These are our sample annotations to be logged:

    ```python theme={null}
    import pandas as pd

    # Sample annotation df with multiple annotations
    annotations_dataframe = pd.DataFrame({
        "context.span_id": [
            "12345",
            "67890",
        ],
        # Categorical annotation: quality
        "annotation.quality.label": ["good", "excellent"],
        "annotation.quality.updated_by": ["annotator_1", "annotator_2"],

        # Optional notes for each span
        "annotation.notes": [
            "User confirmed the summary was helpful.",
            "Response was clear and accurate.",
        ],
    })
    ```

    <CodeGroup>
      ```python Python SDK v8 theme={null}
      from arize import ArizeClient

      client = ArizeClient(api_key="your-arize-api-key")

      response = client.spans.update_annotations(
          space_id="your-arize-space-id",
          project_name="your-project-name",
          dataframe=annotations_dataframe,
          validate=True,
      )
      ```

      ```python Python SDK v7 theme={null}
      from arize.pandas.logger import Client

      arize_client = Client(
          space_id="your-arize-space-id",
          api_key="your-arize-api-key",
      )

      response = arize_client.log_annotations(
          dataframe=annotations_dataframe,
          project_name="your-project-name",
          validate=True,
      )
      ```
    </CodeGroup>

    <span id="annotations-dataframe-schema" />

    <Accordion title="Annotations Dataframe Schema">
      The `annotations_dataframe` requires the following columns:

      1. `context.span_id`: The unique identifier of the span to which the annotations should be attached.
      2. Annotation columns use the pattern `annotation.NAME.SUFFIX`, where **NAME** is your annotation key (for example `quality`, `correctness`, or `sentiment`) using only letters, numbers, and underscores, and **SUFFIX** is one of the field types below:

      * **SUFFIX** defines the type and metadata of the annotation. Valid suffixes are:
        * `label`: For categorical annotations (for example, `_good_`, `_bad_`, `_spam_`). The value should be a string.
        * `score`: For numerical annotations (for example, a rating from 1–5). The value should be numeric (int or float).
        * You must provide at least one `annotation.NAME.label` or `annotation.NAME.score` column for each annotation you want to log.
        * `updated_by` (Optional): A string indicating who made the annotation (for example, `user_id_123` or `annotator_team_a`). If not provided, the SDK automatically sets this to `SDK Logger`.
        * `updated_at` (Optional): A timestamp indicating when the annotation was made, represented as milliseconds since the Unix epoch (integer). If not provided, the SDK automatically sets this to the current UTC time.
      * `annotation.notes` (Optional): A column containing free-form text notes that apply to the entire span, not a specific annotation label or score. The value should be a string.

      An example annotation data dictionary would look like:

      ```python theme={null}
      # Assume TARGET_SPAN_ID holds the ID of the span you want to annotate
      TARGET_SPAN_ID = "3461a49d-e0c3-469a-837b-d83f4a606543"

      annotation_data = {
          "context.span_id": [TARGET_SPAN_ID],
          # Annotation 1: Categorical label, let SDK autogenerate updated_by/updated_at
          "annotation.quality.label": ["good"],
          # Annotation 2: Categorical label, manually set updated_by
          "annotation.relevance.label": ["relevant"],
          "annotation.relevance.updated_by": ["human_annotator_1"],
          # Annotation 3: Numerical score, let SDK autogenerate updated_by/updated_at
          "annotation.sentiment_score.score": [4.5],
          # Optional notes for the span
          "annotation.notes": ["User confirmed the summary was helpful."],
      }
      annotations_dataframe = pd.DataFrame(annotation_data)
      ```
    </Accordion>
  </Tab>
</Tabs>

For routed review workflows and curating labeled examples into a benchmark dataset, see [Labeling Queues](/ax/evaluate/labeling-queues).

## What's next

To automate quality checks, [create evaluators](/ax/evaluate/create-evaluators). If you'd prefer additional human review at scale, see [create a labeling queue](/ax/evaluate/labeling-queues).

## Further reading

* [Hamel Husain: Why is "error analysis" so important in LLM evals?](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed)
