How It Works
- Define an evaluator — Create an evaluator on the server: either a built-in heuristic (exact match, regex, JSON distance, etc.) or an LLM-as-a-judge evaluator backed by a Phoenix-managed prompt.
- Attach it to a dataset — Configure which evaluators apply to a given dataset along with input mappings that tell each evaluator where to find its inputs (model output, reference data, metadata).
- Run experiments — When you run an experiment against that dataset, the attached evaluators execute server-side and label or score each output. Results appear as annotations on experiment runs.
Evaluator Types
Built-in Evaluators
Deterministic, code-based evaluators that run without an LLM. Includes exact match, contains, regex, Levenshtein distance, and JSON distance.
LLM Evaluators
LLM-as-a-judge evaluators backed by Phoenix-managed prompts. Use prompt versioning and tags to iterate on evaluation criteria with full traceability.
Why Use Server Evals
- Consistent evaluation — Evaluators are defined once and applied uniformly. Every experiment against a dataset uses the same criteria, eliminating drift between ad-hoc evaluation scripts.
- No local setup required — Built-in evaluators run entirely on the server with no SDK installation, API keys, or dependencies needed. LLM evaluators use the model configuration already set up on your Phoenix instance.
- Tracing — LLM evaluators produce OpenTelemetry traces in a dedicated project, so you can audit, debug, and improve your evaluation prompts the same way you observe any other LLM workflow.

