Skip to main content

Eval-driven development

You have validated evaluators and a golden dataset. Now make them part of your development process. Every time you change a prompt, swap a model, or update your agent logic, score your experiment results against your golden dataset before deploying. Each evaluation runs in a controlled, repeatable environment so you can measure how new versions behave before exposing them to real users. The same evaluators you use for production monitoring work here. Define the criteria once, apply them everywhere. This page covers running evaluators on experiments. To create experiments and datasets, see Datasets and experiments.
Experiments tab with summary metrics chart and experiments table, row menu open with View Eval Trace and View Logs

Run on an experiment

Once you have an experiment, attach evaluators and score the results.
Use the arize-evaluator skill to create an eval task against an experiment via the ax CLI. Install the Arize skills plugin in your coding agent if you have not already. Then ask your agent:
  • “Create an eval task for my v2-prompt-test experiment using my correctness evaluator”
  • “Trigger an eval run on my latest experiment and wait for results”
  • “Score my v2-prompt-test experiment with my hallucination evaluator”
Terminal showing ax datasets create and help output, ax experiments help, and an in-progress v2-prompt-test experiment with steps to create dataset, run experiment, create evaluator, create eval task, and trigger run
The skill resolves dataset and experiment IDs, configures column mappings, and triggers the run using ax tasks create and ax tasks trigger-run.

View eval results

Results appear in the experiment table alongside each row. Compare multiple experiment runs side by side, filter by eval label, and drill into individual rows to inspect the task output and evaluator explanation. Open View Eval Trace from an experiment row when you want the evaluated trace. For per-example outputs and eval explanations together, use Compare Experiments and the Table tab.
Compare Experiments table tab showing dataset rows, experiment outputs, Evals column with correct and incorrect tags, and an open popover with score, label, and explanation for an evaluator

The CI/CD workflow

  1. Make your change (prompt update, model swap, logic change).
  2. Run your experiment against the golden dataset.
  3. Score the experiment results with your evaluators.
  4. Compare eval scores against the previous experiment run.
  5. If scores regress, fix before merging.
You can integrate this into GitHub Actions or GitLab CI/CD so it runs automatically on every push or pull request. See GitHub Action basics and GitLab CI/CD basics for full guides.