Skip to main content
Dataset Evaluators are evaluators attached directly to a dataset that automatically run when you execute experiments from the Phoenix UI. They act as reusable test cases that validate task outputs every time you iterate on a prompt or model.
Dataset evaluators currently run automatically only for experiments executed from the Phoenix UI (e.g., the Playground). For programmatic experiments, pass evaluators explicitly to run_experiment. See Using Evaluators for details.

Why Use Dataset Evaluators

When iterating on prompts in the Playground, dataset evaluators eliminate the need to manually configure evaluators each time. Attach them once to your dataset, and they run automatically on every experiment.
  • Consistent evaluation: The same criteria applied every time you test
  • Faster iteration: No setup required when running experiments from the UI
  • Built-in tracing: Each evaluator captures traces for debugging and refinement

Creating a Dataset Evaluator

  1. Navigate to your dataset and click the Evaluators tab
  2. Click Add evaluator and choose:
    • LLM evaluator: Use an LLM to judge outputs (e.g., correctness, relevance)
    • Built-in code evaluator: Use deterministic checks (e.g., exact match, regex, contains)
  3. Configure the input mapping to connect evaluator variables to dataset fields
  4. Test with an example, then save

Input Mapping Reference

Dataset evaluators use the same input mapping concepts as the evals library, but the UI exposes them as dataset field paths. You can map evaluator inputs from any of these sources:
  • input: the example input payload
  • output: the example output payload
  • reference: the expected output value
  • metadata: example metadata for filtering, grouping, or scoring context
If your dataset fields are nested, use dot notation (for example input.query, output.response, metadata.intent). For additional mapping patterns and transformation examples, see Input Mapping.

Built-In Code Evaluators

Built-in evaluators are designed for fast, deterministic checks and are configured directly in the UI. Available built-ins and their key settings:
EvaluatorWhat it checksKey settings
containsWhether a text contains one or more wordsCase sensitivity, require all words
exact_matchWhether two values match exactlyCase sensitivity
regexWhether a text matches a regex patternPattern validation, full match vs. partial
levenshtein_distanceEdit distance between expected and actual textCase sensitivity
json_distanceStructural differences between two JSON valuesParse strings as JSON toggle
You can map evaluator inputs from dataset fields or supply literal values (for example, a fixed regex pattern).

Evaluator Traces

Each dataset evaluator has its own project that captures traces. Use these traces to:
  • Debug unexpected evaluation results
  • Identify where your evaluator prompt needs refinement
  • Track how evaluator behavior changes over time
Access traces from the Traces tab in any evaluator’s detail page.
Evaluator traces page

Workflow

When you run an experiment from the Playground against a dataset with evaluators attached, scores are automatically recorded and evaluator traces are captured for debugging.