Eval-driven development
You have validated evaluators and a golden dataset. Now make them part of your development process. Every time you change a prompt, swap a model, or update your agent logic, score your experiment results against your golden dataset before deploying. Each evaluation runs in a controlled, repeatable environment so you can measure how new versions behave before exposing them to real users. The same evaluators you use for production monitoring work here. Define the criteria once, apply them everywhere. This page covers running evaluators on experiments. To create experiments and datasets, see Datasets and experiments.
Run on an experiment
Once you have an experiment, attach evaluators and score the results.- By Arize Skills
- By Alyx
- By UI
- By Code
Use the arize-evaluator skill to create an eval task against an experiment via the
The skill resolves dataset and experiment IDs, configures column mappings, and triggers the run using
ax CLI. Install the Arize skills plugin in your coding agent if you have not already. Then ask your agent:- “Create an eval task for my v2-prompt-test experiment using my correctness evaluator”
- “Trigger an eval run on my latest experiment and wait for results”
- “Score my v2-prompt-test experiment with my hallucination evaluator”

ax tasks create and ax tasks trigger-run.View eval results
Results appear in the experiment table alongside each row. Compare multiple experiment runs side by side, filter by eval label, and drill into individual rows to inspect the task output and evaluator explanation. Open View Eval Trace from an experiment row when you want the evaluated trace. For per-example outputs and eval explanations together, use Compare Experiments and the Table tab.
The CI/CD workflow
- Make your change (prompt update, model swap, logic change).
- Run your experiment against the golden dataset.
- Score the experiment results with your evaluators.
- Compare eval scores against the previous experiment run.
- If scores regress, fix before merging.
Related workflows
- Define evaluators once in Create evaluators, then attach them to experiments.
- After you ship, use Run online evals on traces to monitor production with the same criteria where possible.
- Follow the Develop tutorials: Defining the dataset, Run experiments with code evals, Run experiments with LLM judge, and Iteration workflow.

