Skip to main content
When you ran a prompt against a dataset with evaluators attached in the Playground, you ran an experiment. The Playground is one way to do that; the SDK and the CLI are two more. All three produce the same kind of record — and that’s the point. This page is about the experiment as a concept: what’s in one, how the three execution routes converge, and what comparability buys you.

The three parts of an experiment

Three input boxes — dataset, task, evaluator — converging on a single experiment record, with three execution route labels on the right (UI / code / CLI) all feeding the same record
PartWhat it is
DatasetThe input rows. Each row exercises the task once.
TaskThe thing being tested. For prompt experiments, the task is “run this Prompt Object against the row’s input variables.” But task is more general — it can wrap an entire agent, a multi-step pipeline, anything that takes a row and produces an output.
EvaluatorThe scoring function. One or more evaluators run on each row’s output and produce labels and scores.
An experiment run is one execution of the (dataset × task × evaluators) tuple. The record captures every row’s input, the task’s output, every evaluator’s score, plus run-level metadata (duration, who ran it, when, against what prompt version).

Three execution routes, one record

You can run an experiment three ways. All three produce the same kind of record — same dataset, same task definition, same evaluator outputs — and they all show up in the same Datasets and Experiments view.
RouteWhat it looks likeWhen to use
Playground (UI)Click Run with a dataset and evaluators attachedInteractive iteration. You’re trying things and want to see results immediately.
Code (Python / TypeScript / REST)A script that defines the dataset, task, and evaluator, then calls experiments.run(...) (or equivalent)Reproducible runs from a notebook or script. Useful for batch jobs, controlled experiments, and CI.
CLI / Agent Skillsax CLI invocations, scriptable from any agent or shellSame as code but driven by the ax CLI. Useful when you’re working from a terminal-native agent like the Arize CLI agent.
The fact that all three routes produce the same record is what makes the loop trustworthy. A run that was kicked off interactively in the Playground is comparable to a run that came from a CI script, which is comparable to a run from a teammate’s notebook.

What comparability buys you

Two experiment runs on the same dataset are directly comparable. You can:
  • Diff outputs side-by-side — see exactly which rows the new prompt handled differently.
  • Read evaluator deltas — accuracy went from 0.62 to 0.78; what does that look like row-by-row?
  • Spot regressions — which rows did the old prompt get right that the new prompt now gets wrong?
  • Read summary metrics by run — overall pass rate, mean score, distribution of labels.
Comparison is the operational unit of prompt iteration. “Is the new prompt better?” is too vague to answer. “Run experiment B against the same dataset and evaluators as experiment A, and look at the per-row deltas” is precise and decidable.
Trip-planner test set Experiments tab showing a Summary Metrics line chart that climbs from 0.86 (v1) to 1.00 (v2), with both experiments selected via checkboxes in the Compare dropdown, and the Experiments table listing trip-planner-v2 and trip-planner-v1 rows with their itinerary_structure scores

The role of the prompt version

Every experiment run records which prompt version it tested — by hash. That hash is permanent. Two implications:
  • Re-runs are reproducible. If you re-run an experiment from three weeks ago against the same dataset and evaluators, the same prompt version comes back from the Hub. Same inputs, same task, same evaluators. The only thing that should differ is non-determinism in the LLM itself.
  • Experiments are auditable. A bug report against a specific experiment run points to a specific prompt version. You can load that version into the Playground, reproduce the bad row, and iterate without guessing what was actually deployed.

Saving runs back to datasets

When you run a prompt experiment in the Playground, the outputs and eval scores can be saved back as columns on the dataset. Each Playground run is timestamped and tagged with the prompt version. Subsequent runs add more columns alongside. The dataset becomes a growing record of how every prompt version performed on the same rows. Over time, that’s the most useful artifact you have for understanding the evolution of a prompt — you can see at a glance which versions improved which rows.

When the task isn’t just “run the prompt”

For most prompt experiments, the task is a one-liner: render the prompt against the row’s variables, call the LLM, return the output. But the experiment framework treats task as arbitrary — it can be:
  • A multi-step pipeline (retrieve, render, call LLM, post-process).
  • A whole agent (route, plan, call tools, summarize, return).
  • A wrapper around an existing application endpoint.
This generality matters because it means prompt experiments and agent experiments and end-to-end-pipeline experiments are all the same machinery. You’re not choosing between three different testing frameworks — you’re choosing what task wraps.

Next step

Experiments work the same whether you run them by hand or in CI. The next page covers what changes when you put them in CI/CD.

Next: Prompts in CI/CD