Span-Level Evals

Span-level evaluation focuses on assessing performance at the level of an individual step within a larger system, such as a single LLM call, retrieval action, or tool call.

This level of analysis helps identify where breakdowns occur. For instance, an error can stem from an incorrect retrieval result, rather than attributing failure only to the final output. This provides the fine-grained diagnostics needed to improve reliability across complex LLM pipelines.

Span-Level Evaluations via UI

When creating an evaluation task, you can filter for the types of spans you want to evaluate. After defining your filters, select “Span” as the scope when creating your evaluator.

Last updated

Was this helpful?