Span-Level Evals
Span-level evaluation focuses on assessing performance at the level of an individual step within a larger system, such as a single LLM call, retrieval action, or tool call.
This level of analysis helps identify where breakdowns occur. For instance, an error can stem from an incorrect retrieval result, rather than attributing failure only to the final output. This provides the fine-grained diagnostics needed to improve reliability across complex LLM pipelines.
Span-Level Evaluations via UI
When creating an evaluation task, you can filter for the types of spans you want to evaluate. After defining your filters, select “Span” as the scope when creating your evaluator.
When spans are filtered at the task level, this filtering also applies to any trace-level or session-level evaluators defined within the task. The filter matches any traces or sessions that contain matching spans.
Last updated
Was this helpful?