Span-level evaluation focuses on assessing performance at the level of an individual step within a larger system, such as a single LLM call, retrieval action, or tool call. This level of analysis helps identify where breakdowns occur. For instance, an error can stem from an incorrect retrieval result, rather than attributing failure only to the final output. This provides the fine-grained diagnostics needed to improve reliability across complex LLM pipelines.

Span-Level Evaluations via UI
Span-Level Evaluations via Code

Span-Level Evaluations via UI

When creating an evaluation task, you can filter for the types of spans you want to evaluate. After defining your filters and evaluator, select “Span” as the scope when adding an evaluator to your task.

When spans are filtered at the task level, this filtering also applies to any trace-level or session-level evaluators defined within the task. The filter matches any traces or sessions that contain matching spans.

Span-level evaluation configuration in the evaluation hub

Span-Level Evaluations via Code

By default, exporting data from an Arize project returns a dataframe containing all spans within the selected timeframe:

from datetime import datetime
from arize import ArizeClient

client = ArizeClient(api_key="your-arize-api-key")
primary_df = client.spans.export_to_df(
    space_id="your-arize-space-id",
    project_name="your-project-name",
    start_time=datetime.fromisoformat(''),  # prefilled by export
    end_time=datetime.fromisoformat(''),    # prefilled by export
)

This export returns all spans from your project. To perform a span-level evaluation, filter the dataframe for the span types you want to analyze. For example, if you want to evaluate LLM spans:

filtered_spans = primary_df[primary_df["attributes.openinference.span.kind"] == "LLM"]

2. Define your evaluation prompt

Create a prompt template for the LLM judge to assess individual spans. For example, to evaluate the tone of responses:

SPAN_EVAL_TEMPLATE = '''
You are evaluating the positivity or negativity of the responses to questions.
[BEGIN DATA]
************
[Question]: {input}
************
[Response]: {output}
[END DATA]

Please focus on the tone of the response.
Your answer must be single word, either "positive" or "negative"
'''

Check which attributes are present in your spans dataframe and map them to template variables:

filtered_spans["input"] = filtered_spans["attributes.input.value"]
filtered_spans["output"] = filtered_spans["attributes.output.value"]

3. Run the evaluation

Use Phoenix Evals to run your evaluation on the filtered spans:

from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

tone_evaluator = create_classifier(
    name="tone",
    llm=llm,
    prompt_template=SPAN_EVAL_TEMPLATE,
    choices={"positive": 1.0, "negative": 0.0},
)

results_df = await async_evaluate_dataframe(
    dataframe=filtered_spans,
    evaluators=[tone_evaluator],
)

4. Log the results back to Arize

Use the Python SDK to log evaluations to spans in the UI. This requires 4 columns:

eval.<eval_name>.label
eval.<eval_name>.score
eval.<eval_name>.explanation
context.span_id

We use the to_annotation_dataframe utility to format our results:

from arize import ArizeClient
from phoenix.evals.utils import to_annotation_dataframe

client = ArizeClient(api_key="your-arize-api-key")
tone_eval_df = to_annotation_dataframe(results_df)

tone_eval_df = tone_eval_df.rename(columns={
    "label": "eval.tone.label",
    "score": "eval.tone.score",
    "explanation": "eval.tone.explanation"
})

client.spans.update_evaluations(
    space_id="your-arize-space-id",
    project_name="your-project-name",
    dataframe=tone_eval_df,
)

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Span-Level Evals

Span-Level Evaluations via UI

Span-Level Evaluations via Code

2. Define your evaluation prompt

3. Run the evaluation

4. Log the results back to Arize

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

​Span-Level Evaluations via UI

​Span-Level Evaluations via Code

​2. Define your evaluation prompt

​3. Run the evaluation

​4. Log the results back to Arize

Span-Level Evaluations via UI

Span-Level Evaluations via Code

2. Define your evaluation prompt

3. Run the evaluation

4. Log the results back to Arize