The three levels
| Level | Scope | The shape of question it answers |
|---|---|---|
| Span | One unit of work — a single LLM call, tool invocation, or retrieval. | Was this thing done correctly? |
| Trace | The full tree of spans for one request, root to leaf. | Was the sequence of work the right one? |
| Session | A collection of traces sharing a session.id — typically a multi-turn conversation. | Did the conversation go well overall? |
Span evaluators
Span evaluators ask questions about a single unit of work. They are the most common kind and the easiest to reason about — the evaluator sees one span, has access to that span’s attributes, and emits a score. Typical questions a span evaluator answers:- Was the right tool selected for this call?
- Were the tool parameters extracted correctly?
- Was this individual LLM response factually correct?
- Was this retrieval relevant to the query that triggered it?
- Did this guardrail decision make sense?
attributes.llm.input_messages, attributes.llm.output_messages, attributes.tool.name, attributes.retrieval.documents, and so on. See Semantic conventions for the full attribute namespace.
Trace evaluators
Trace evaluators ask questions about the shape of a request — how the application got from the user’s input to the final output. They see every span in the trace and judge the whole path. Typical questions a trace evaluator answers:- Was the order of tool calls correct?
- Was the agent’s trajectory efficient, or did it loop?
- Did the application skip a step it should have taken?
- Was the chain of LLM calls + retrievals + tool calls coherent?
Session evaluators
Session evaluators look at a whole conversation. They see every trace that shares asession.id — typically every turn of a multi-turn chat — and judge the conversation as a unit.
Typical questions a session evaluator answers:
- Is the conversation coherent across turns?
- Did the assistant maintain a consistent tone?
- Did the conversation reach resolution, or did it stall?
- Did the user appear frustrated by the end?
session.id. See OpenInference context managers for how that gets set.
Picking the level
A simple decision tree:- Can you answer the question by looking at one span in isolation? → Span
- Do you need to compare spans within the same request? → Trace
- Do you need to look across multiple requests in the same conversation? → Session
--data-granularity flag (values: span, trace, session). All three are first-class options.
A few subtleties to watch for:
- Scope creep at design time. It’s tempting to use a session evaluator “to be safe” when a span evaluator would do. Wider scope means more tokens passed to the judge, higher cost, and lower signal because the judge has to pick the needle out of more hay. Pick the narrowest level that can answer the question.
- Mixing levels for one application. Most production applications need evaluators at multiple levels — tool-correctness at the span level, agent-trajectory at the trace level, conversation-tone at the session level. They run in parallel and don’t interfere with each other.
- Cost and latency scale with level. A span evaluator might see a few hundred tokens of context. A session evaluator might see tens of thousands. This matters when you’re picking the judge model and the sampling rate.