Span-Level Evals
Span-level evaluation focuses on assessing performance at the level of an individual step within a larger system, such as a single LLM call, retrieval action, or tool call.
This level of analysis helps identify where breakdowns occur. For instance, an error can stem from an incorrect retrieval result, rather than attributing failure only to the final output. This provides the fine-grained diagnostics needed to improve reliability across complex LLM pipelines.
Span-Level Evaluations via UI
When creating an evaluation task, you can filter for the types of spans you want to evaluate. After defining your filters, select “Span” as the scope when creating your evaluator.
When spans are filtered at the task level, this filtering also applies to any trace-level or session-level evaluators defined within the task. The filter matches any traces or sessions that contain matching spans.
Span-Level Evaluations via Code
By default, exporting data from an Arize project returns a dataframe containing all spans within the selected timeframe:
from datetime import datetime
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
client = ArizeExportClient()
primary_df = client.export_model_to_df(
space_id='', # this will be prefilled by export
model_id='', # this will be prefilled by export
environment=Environments.TRACING,
start_time=datetime.fromisoformat(''), # this will be prefilled by export
end_time=datetime.fromisoformat(''), # this will be prefilled by export
)This export returns all spans from your project. To perform a span-level evaluation, filter the dataframe for the span types you want to analyze. For example, if you want to evaluate LLM spans:
filtered_spans = primary_df[primary_df["attributes.openinference.span.kind"] == "LLM"]From here, you can define your evaluator and run it on your spans.
Last updated
Was this helpful?

