The first design decision when building an evaluator is the level it runs at. Span, trace, and session evaluators look at different slices of data — and answer different shapes of question. Picking the wrong level means the evaluator either can’t see enough data to judge, or sees more than it needs and loses focus. This page covers the conceptual distinction. For the language of spans, traces, and sessions themselves, see Signals, spans, traces, and sessions in the OpenTelemetry concepts section.

The three levels

Level	Scope	The shape of question it answers
Span	One unit of work — a single LLM call, tool invocation, or retrieval.	Was this thing done correctly?
Trace	The full tree of spans for one request, root to leaf.	Was the sequence of work the right one?
Session	A collection of traces sharing a `session.id` — typically a multi-turn conversation.	Did the conversation go well overall?

The level you pick is determined by where the data you need to judge actually lives. If everything you need to answer the question is in one span, use a span evaluator. If you need to look across multiple spans within a single request, use a trace evaluator. If you need to look across multiple requests in the same conversation, use a session evaluator.

Span evaluators

Span evaluators ask questions about a single unit of work. They are the most common kind and the easiest to reason about — the evaluator sees one span, has access to that span’s attributes, and emits a score. Typical questions a span evaluator answers:

Was the right tool selected for this call?
Were the tool parameters extracted correctly?
Was this individual LLM response factually correct?
Was this retrieval relevant to the query that triggered it?
Did this guardrail decision make sense?

The data they read lives entirely under one span’s attributes — attributes.llm.input_messages, attributes.llm.output_messages, attributes.tool.name, attributes.retrieval.documents, and so on. See Semantic conventions for the full attribute namespace.

Trace evaluators

Trace evaluators ask questions about the shape of a request — how the application got from the user’s input to the final output. They see every span in the trace and judge the whole path. Typical questions a trace evaluator answers:

Was the order of tool calls correct?
Was the agent’s trajectory efficient, or did it loop?
Did the application skip a step it should have taken?
Was the chain of LLM calls + retrievals + tool calls coherent?

Trace evaluators are what you reach for when individual spans look fine but the way they fit together doesn’t. A span-level “tool call correctness” eval might pass on every tool call individually while the agent calls tools in a nonsensical order. Only a trace evaluator catches that.

Session evaluators

Session evaluators look at a whole conversation. They see every trace that shares a session.id — typically every turn of a multi-turn chat — and judge the conversation as a unit. Typical questions a session evaluator answers:

Is the conversation coherent across turns?
Did the assistant maintain a consistent tone?
Did the conversation reach resolution, or did it stall?
Did the user appear frustrated by the end?

Session-level questions are the ones individual traces can’t answer because they require the history. A single turn of a chat might be perfectly fine on its own and still be unhelpful in the context of the previous five turns. Sessions only exist when your application is instrumented to set session.id. See OpenInference context managers for how that gets set.

Picking the level

A simple decision tree:

Can you answer the question by looking at one span in isolation? → Span
Do you need to compare spans within the same request? → Trace
Do you need to look across multiple requests in the same conversation? → Session

The level isn’t a property of the evaluator template — it’s a property of the evaluator task. The underlying knob is the --data-granularity flag (values: span, trace, session). All three are first-class options. A few subtleties to watch for:

Scope creep at design time. It’s tempting to use a session evaluator “to be safe” when a span evaluator would do. Wider scope means more tokens passed to the judge, higher cost, and lower signal because the judge has to pick the needle out of more hay. Pick the narrowest level that can answer the question.
Mixing levels for one application. Most production applications need evaluators at multiple levels — tool-correctness at the span level, agent-trajectory at the trace level, conversation-tone at the session level. They run in parallel and don’t interfere with each other.
Cost and latency scale with level. A span evaluator might see a few hundred tokens of context. A session evaluator might see tens of thousands. This matters when you’re picking the judge model and the sampling rate.

Next step

Levels tell you where to look. The next decision is what kind of evaluator does the looking:

OpenTelemetry and OpenInference

Prompts

Evaluators

adb

Evaluation Levels — Span, Trace, and Session

The three levels

Span evaluators

Trace evaluators

Session evaluators

Picking the level

Next step

Next: Evaluator Types

​The three levels

​Span evaluators

​Trace evaluators

​Session evaluators

​Picking the level

​Next step

Next: Evaluator Types

The three levels

Span evaluators

Trace evaluators

Session evaluators

Picking the level

Next step