Log Evals to Traces

You may have evaluators that run on large datasets or use additional external data sources. To help manage resources and control costs, Arize gives you the flexibility to decide when and how your evals run and tracked. With these self-managed evals, you stay in control of execution, data, and evaluator configuration.

1. Import Spans in Code

First, export your traces from Arize. Visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button and choosing Export to Notebook, you can get the boilerplate code to copy/paste to your evaluator.

# This will be prefilled by the export command.

import os
from datetime import datetime
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

os.environ['ARIZE_API_KEY'] = 'YOUR_API_KEY'

client = ArizeExportClient()

print('#### Exporting your primary dataset into a dataframe.')

primary_df = client.export_model_to_df(
    space_id='your-space-id', # this will be prefilled by export
    model_id='your-model-id', # this will be prefilled by export
    environment=Environments.TRACING,
    start_time=datetime.fromisoformat(''), # this will be prefilled by export
    end_time=datetime.fromisoformat(''),   # this will be prefilled by export
    # Optionally specify columns to improve query performance
    # columns=['context.span_id', 'attributes.llm.input']
)

2. Run an Eval

We will run through a sample LLM as a Judge eval. First, define an evaluation template:

MY_SAMPLE_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Response]: {output}
    [END DATA]

    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

Check which attributes are present in your traces dataframe:

primary_df.columns

If you're using OpenAI traces, set the input/output variables like this:

primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]
from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-5")

sample_evaluator = create_classifier(
    name="sample-eval",
    llm=llm,
    prompt_template=MY_SAMPLE_TEMPLATE,
    choices={"correct": 1.0, "incorrect": 0.0},
)

results_df = await async_evaluate_dataframe(
    dataframe=primary_df,
    evaluators=[sample_evaluator],
)

3. Log Evals

Use the log_evaluations_sync function as part of our Python SDK to attach evaluations you've run to data in the UI. This function requires 4 columns:

  • eval.<eval_name>.label

  • eval.<eval_name>.score

  • eval.<eval_name>.explanation

  • context.span_id

The code below assumes that you have already completed an evaluation run. We use the to_annotation_dataframe utility to format our results.

from arize.pandas.logger import Client
from phoenix.evals.utils import to_annotation_dataframe

client = Client()
sample_eval_df = to_annotation_dataframe(results_df)

sample_eval_df = sample_eval_df.rename(columns={
    "label": "eval.correctness.label",
    "score": "eval.correctness.score",
    "explanation": "eval.correctness.explanation"
})

client.log_evaluations_sync(sample_eval_df, 'your-project-name')

Last updated

Was this helpful?