Prerequisites
- A dataset in your space (Build a dataset).
- A registered agent configuration (Setting up your agent endpoint).
- Optional but recommended: tracing wired into your agent (Setting up tracing).
Launch an experiment
Open the dataset
Navigate to Datasets in the left nav and click the dataset you want to run against.
Click New Experiment → Run against agent
The agent playground modal opens with the dataset already selected.
Pick an agent
Use the Agent dropdown to select one of your registered agent configurations. The endpoint URL and auth from the configuration are used automatically — you don’t re-enter them per run.
Choose a preset (or write custom config)
If the agent has request presets, pick one from the Preset dropdown. The preset’s config JSON populates the editor.You can edit the JSON before running — useful when you want to tweak one parameter from a known-good baseline without saving a new preset.
Set the body template
The Body Template field is the JSON Arize sends as
input in the request. Use {{column_name}} to interpolate from each dataset row:(Optional) Limit to a subset
For quick sanity checks, run on a subset of the dataset before committing to the full run. Useful for validating the request shape with one or two rows first.
(Optional) Add evaluators
Attach evaluators to score each run’s output automatically. See Evals overview for setup.
Watch a run in progress
As the experiment runs, the experiments view streams rows in real time:- Status column — pending, running, succeeded, failed.
- Output column — the agent’s response body.
- Latency / tokens — populated from the spans your agent emitted (if traced).
- Evaluator scores — computed as each row completes.
Inspecting an individual run
Click any row to see:- Input — exact JSON body sent to your endpoint, including the hydrated
inputand thearize_metadataArize appended. - Output — full response body returned.
- Trace — if your agent is traced, the linked trace tree (CHAIN, LLM, TOOL spans).
- Headers — the request headers Arize sent, including
traceparent,Authorization, and any custom headers you configured. - Evaluator scores — per-evaluator pass/fail and reasoning.
Comparing runs
After you have two or more experiments on the same dataset, comparison is the point.From the experiments tab
Open Datasets → your dataset → Experiments, multi-select the runs you want to compare, and click Compare. The comparison view shows:- Side-by-side outputs for each dataset row across the selected runs.
- Evaluator deltas — which rows improved, regressed, or stayed flat.
- Summary metrics — pass rate, average latency, token counts per run.
- Tool-call patterns — if traced, you can see which runs called different tools or took different paths.
Common comparison patterns
| Question | How to set it up |
|---|---|
| Did the new model help? | Run two experiments, same dataset and preset, vary only config.model. |
| Did the prompt change break anything? | Run baseline before the deploy, then run again after. Compare evaluator deltas. |
| Which preset is best for prod? | Run each preset against the same dataset. Eyeball the summary table. |
| Is this regression specific to one input type? | Filter the comparison view by a metadata column (e.g. category). |
Re-running failed rows
In the experiment detail view, filter for status = failed, select them, and click Re-run. Arize calls your endpoint just for those rows and merges the new results into the existing experiment — so you don’t have to lose the successful rows when chasing one flake.Concurrency, timeouts, and retries
When you launch an experiment, you can tune:- Concurrency — how many parallel POST requests Arize sends. Default is 10. Lower it if your agent’s downstream API has a rate limit.
- Timeout — per-request, in seconds. Default 120s. Raise it for agents with long agent loops (e.g. multi-step research agents).
- Retries — for transient failures (5xx, network errors), Arize retries with exponential backoff. Defaults are conservative; raise the retry count for agents you know are flaky.
Running from code or CLI
The agent playground is the UI path. If you’d rather drive it from code (e.g. as part of a CI pipeline), use:- Python SDK —
AgentEndpointTaskinarize.experiments. See Experiment in code. axCLI —ax agent-replay run --dataset-name <ds> --agent-config-name <agent>for headless runs.- REST API — for orchestration from any language.
End-to-end example
Walking through the travel-agent demo (registered agent, dataset of 20 travel goals, three presets):Pick the dataset
Open
travel-goals-v1 (20 goals like “Plan a 3-day trip to Tokyo from SF in October”, “Weekend in NYC from Chicago”).Launch with the baseline preset
New Experiment → Run against agent → travel-agent → Production baseline (Sonnet 4.5) → Run. Wait ~3 minutes for 20 rows to complete.
Launch a second experiment with Opus
Same dataset, same agent, Opus 4.7 preset. Run. ~5 minutes (Opus is slower).
Compare
Compare Experiments → see per-row output diffs. Opus produced richer itineraries on 16/20 rows, but average latency was 2.3× higher. Pass rate on the “produces a coherent multi-day plan” evaluator: Sonnet 18/20, Opus 20/20.
Inspect a regression
On row 7 (Lisbon), Sonnet picked a $$$$ hotel; Opus picked a $$ one. Open the Sonnet trace, see the
search_hotels TOOL span — it ranked by rating, not by max_price constraint. Fix is a system prompt tweak.Next
Compare experiments
Side-by-side diffs and evaluator deltas across runs.
Run evals on experiments
Add evaluators to score agent outputs.
CI/CD with experiments
Trigger agent experiments from your deploy pipeline.
Experiment in code
Drive agent experiments from Python / TypeScript / CLI.