Code Evaluator Output Shapes

This page documents every return shape a code evaluator can produce and how Phoenix maps each one to an EvaluationResult with label, score, and explanation fields.

This page covers the server-side code evaluators that run in the Phoenix UI (Sandbox evaluators). For client-side create_evaluator / createEvaluator SDK evaluators, see Code Evaluators.

The Triple-Collapse Model

Every return value from a code evaluator is normalized to a triple: (label, score, explanation). Phoenix applies this in two stages:

Stage 1 — Extract: The raw return value is mapped to a Triple based on its shape (bare scalar or dict-by-key).
Stage 2 — Validate: The triple is checked against the evaluator’s output config (categorical, continuous, or none).

Any value that cannot be cleanly mapped raises a ValueError whose message enumerates all accepted shapes for the configured output type.

Accepted Shapes by Output Config

Categorical Output Config

A categorical config defines a fixed set of {label, score} pairs. The evaluator must return one of the configured labels; Phoenix looks up the associated score automatically. Bare string (recommended):

Python
TypeScript

return "pass"

return "pass";

Dict with label and optional explanation:

Python
TypeScript

return {"label": "pass", "explanation": "The output matched the expected format."}

return {"label": "pass", "explanation": "The output matched the expected format."};

Notes:

The label must exactly match one of the configured values; unrecognized labels raise ValueError.
Including a score key in the dict that conflicts with the config’s lookup value raises ValueError.
Free-form explanation strings are always accepted and passed through to EvaluationResult.explanation.
Tuple shorthand (return ("pass", 1.0)) is not accepted; use the dict form if you need to supply additional fields.

Continuous Output Config

A continuous config validates that the returned value is a finite number within optional lower_bound / upper_bound bounds. Labels are optional and free-form. Bare number (recommended):

Python
TypeScript

return 0.85

return 0.85;

Dict with score and optional explanation:

Python
TypeScript

# score in range 0.0 - 1.0
return {"score": 0.85, "explanation": "High confidence based on keyword match."}

// score in range 0.0 - 1.0
return {"score": 0.85, "explanation": "High confidence based on keyword match."};

Notes:

bool values are not treated as numeric and raise ValueError.
NaN and Infinity are rejected.
Free-form string labels are allowed in the dict form alongside a numeric score.
Tuple shorthand is not accepted.

No Output Config

When no output config is specified, Phoenix applies a permissive bare passthrough:

Return value	Result
`str`	`label=<value>`
`int` or `float`	`score=<value>`
`bool`	`label="True"` or `label="False"` (not numeric)
`None`	`(label=None, score=None)`
`{"label": ..., "score": ..., "explanation": ...}`	triple by key

Lists and arbitrary nested objects are rejected — they previously silently stringified into labels, which masked misconfiguration. Return a recognized shape instead.

The `explanation` Field

Any accepted shape may include an explanation string. Phoenix passes it through to EvaluationResult.explanation unchanged:

Python
TypeScript

return {"label": "fail", "explanation": "Response contained prohibited content."}

return {"label": "fail", "explanation": "Response contained prohibited content."};

The explanation appears in the Phoenix UI alongside the label and score and is available in the evaluation results API.

Multi-Output Evaluators

When an evaluator has multiple output configs (e.g., one for toxicity and one for safety), Phoenix supports two routing modes:

Shared value (default)

Return a single value — Phoenix applies the same return value to each output config independently:

Python
TypeScript

return "pass"  # applied to every output config

return "pass";  // applied to every output config

Per-config routing dict

Return a dict whose keys match every output config name. Phoenix routes each value to the corresponding config:

Python
TypeScript

return {
    "toxicity": 0.1,
    "safety": "pass",
    "explanation": "Content appears safe.",  # shared fallback
}

return {
    "toxicity": 0.1,
    "safety": "pass",
    "explanation": "Content appears safe.",  // shared fallback
};

Routing rules:

The dict must contain a key for every output config name; a partial match is treated as a shared value, not a routing dict.
A top-level "explanation" key acts as a shared fallback: if a per-config sub-value omits explanation, the top-level value fills it in.
Per-config sub-values may themselves be dicts with their own "explanation" key — per-config explanation takes precedence over the shared fallback.

Per-config explanation example:

Python
TypeScript

return {
    "toxicity": {"score": 0.9, "explanation": "Contains slurs."},
    "safety": "fail",
    "explanation": "Overall content is unsafe.",  # only used for safety
}

return {
    "toxicity": {"score": 0.9, "explanation": "Contains slurs."},
    "safety": "fail",
    "explanation": "Overall content is unsafe.",  // only used for safety
};

Multi-output naming convention

Each output config produces a separate EvaluationResult named {evaluator_name}.{config_name}. For example, an evaluator named content-check with configs toxicity and safety produces two results: content-check.toxicity and content-check.safety.

Error Messages

When a return value does not match the accepted shapes, the ValueError message enumerates all valid shapes for the configured output type in the evaluator’s language. For example, a categorical config with values ["pass", "fail"] in Python would produce:

Label 'unknown' not in categorical output config values ['pass', 'fail'].
Valid shapes:
  return "pass"
  return {"label": "pass", "explanation": "..."}

This makes it straightforward to identify and fix mismatches without consulting documentation.

​The Triple-Collapse Model

​Accepted Shapes by Output Config

​Categorical Output Config

​Continuous Output Config

​No Output Config

​The explanation Field

​Multi-Output Evaluators

​Shared value (default)

​Per-config routing dict

​Multi-output naming convention

​Error Messages

The Triple-Collapse Model

Accepted Shapes by Output Config

Categorical Output Config

Continuous Output Config

No Output Config

The `explanation` Field

Multi-Output Evaluators

Shared value (default)

Per-config routing dict

Multi-output naming convention

Error Messages