The three parts of an experiment

| Part | What it is |
|---|---|
| Dataset | The input rows. Each row exercises the task once. |
| Task | The thing being tested. For prompt experiments, the task is “run this Prompt Object against the row’s input variables.” But task is more general — it can wrap an entire agent, a multi-step pipeline, anything that takes a row and produces an output. |
| Evaluator | The scoring function. One or more evaluators run on each row’s output and produce labels and scores. |
Three execution routes, one record
You can run an experiment three ways. All three produce the same kind of record — same dataset, same task definition, same evaluator outputs — and they all show up in the same Datasets and Experiments view.| Route | What it looks like | When to use |
|---|---|---|
| Playground (UI) | Click Run with a dataset and evaluators attached | Interactive iteration. You’re trying things and want to see results immediately. |
| Code (Python / TypeScript / REST) | A script that defines the dataset, task, and evaluator, then calls experiments.run(...) (or equivalent) | Reproducible runs from a notebook or script. Useful for batch jobs, controlled experiments, and CI. |
| CLI / Agent Skills | ax CLI invocations, scriptable from any agent or shell | Same as code but driven by the ax CLI. Useful when you’re working from a terminal-native agent like the Arize CLI agent. |
What comparability buys you
Two experiment runs on the same dataset are directly comparable. You can:- Diff outputs side-by-side — see exactly which rows the new prompt handled differently.
- Read evaluator deltas — accuracy went from 0.62 to 0.78; what does that look like row-by-row?
- Spot regressions — which rows did the old prompt get right that the new prompt now gets wrong?
- Read summary metrics by run — overall pass rate, mean score, distribution of labels.

The role of the prompt version
Every experiment run records which prompt version it tested — by hash. That hash is permanent. Two implications:- Re-runs are reproducible. If you re-run an experiment from three weeks ago against the same dataset and evaluators, the same prompt version comes back from the Hub. Same inputs, same task, same evaluators. The only thing that should differ is non-determinism in the LLM itself.
- Experiments are auditable. A bug report against a specific experiment run points to a specific prompt version. You can load that version into the Playground, reproduce the bad row, and iterate without guessing what was actually deployed.
Saving runs back to datasets
When you run a prompt experiment in the Playground, the outputs and eval scores can be saved back as columns on the dataset. Each Playground run is timestamped and tagged with the prompt version. Subsequent runs add more columns alongside. The dataset becomes a growing record of how every prompt version performed on the same rows. Over time, that’s the most useful artifact you have for understanding the evolution of a prompt — you can see at a glance which versions improved which rows.When the task isn’t just “run the prompt”
For most prompt experiments, the task is a one-liner: render the prompt against the row’s variables, call the LLM, return the output. But the experiment framework treatstask as arbitrary — it can be:
- A multi-step pipeline (retrieve, render, call LLM, post-process).
- A whole agent (route, plan, call tools, summarize, return).
- A wrapper around an existing application endpoint.
task wraps.