Running agent experiments

Once you’ve registered an agent and wired up tracing, running an experiment against it is a no-code workflow. This page covers the agent playground UI for kicking off, monitoring, and comparing runs.

Prerequisites

A dataset in your space (Build a dataset).
A registered agent configuration (Setting up your agent endpoint).
Optional but recommended: tracing wired into your agent (Setting up tracing).

Launch an experiment

Open the dataset

Navigate to Datasets in the left nav and click the dataset you want to run against.

Click New Experiment → Run in Agent Playground

The agent playground modal opens with the dataset already selected.

Pick an agent

Use the Agent dropdown to select one of your registered agent configurations. The endpoint URL and auth from the configuration are used automatically — you don’t re-enter them per run.

Choose a preset (or write custom config)

If the agent has request presets, pick one from the Preset dropdown. The preset’s config JSON populates the editor.You can edit the JSON before running — useful when you want to tweak one parameter from a known-good baseline without saving a new preset.

Set the body template

The Body Template field is the JSON Arize sends as input in the request. Use {{dataset.column_name}} to interpolate from each dataset row:

{
  "goal": "{{dataset.input}}",
  "config": { "model": "claude-sonnet-4-6", "max_turns": 12 }
}

(Optional) Limit to a subset

For quick sanity checks, run on a subset of the dataset before committing to the full run. Useful for validating the request shape with one or two rows first.

(Optional) Add evaluators

Attach evaluators to score each run’s output automatically. See Evals overview for setup.

Click Run

Arize fans out the dataset against your endpoint in parallel, applies retries on transient failures, and streams results into a new experiment.

Watch a run in progress

As the experiment runs, the experiments view streams rows in real time:

Status column — pending, running, succeeded, failed.
Output column — the agent’s response body.
Latency / tokens — populated from the spans your agent emitted (if traced).
Evaluator scores — computed as each row completes.

If a row fails (timeout, HTTP error, agent exception), the error message appears in the run’s detail panel along with the request body that was sent — useful for debugging and re-running just the failed subset.

Inspecting an individual run

Click any row to see:

Input — exact JSON body sent to your endpoint, including the hydrated input and the arize_metadata Arize appended.
Output — full response body returned.
Trace — if your agent is traced, the linked trace tree (CHAIN, LLM, TOOL spans).
Headers — the request headers Arize sent, including traceparent, Authorization, and any custom headers you configured.
Evaluator scores — per-evaluator pass/fail and reasoning.

The trace link is the highest-leverage debugging tool: when an agent run produces an unexpected output, opening the trace shows you which tool was called, what the LLM was prompted with at each turn, and where the chain diverged.

Comparing runs

After you have two or more experiments on the same dataset, comparison is the point.

From the experiments tab

Open Datasets → your dataset → Experiments, multi-select the runs you want to compare, and click Compare. The comparison view shows:

Side-by-side outputs for each dataset row across the selected runs.
Evaluator deltas — which rows improved, regressed, or stayed flat.
Summary metrics — pass rate, average latency, token counts per run.
Tool-call patterns — if traced, you can see which runs called different tools or took different paths.

Common comparison patterns

Question	How to set it up
Did the new model help?	Run two experiments, same dataset and preset, vary only `config.model`.
Did the prompt change break anything?	Run baseline before the deploy, then run again after. Compare evaluator deltas.
Which preset is best for prod?	Run each preset against the same dataset. Eyeball the summary table.
Is this regression specific to one input type?	Filter the comparison view by a metadata column (e.g. `category`).

Re-running failed rows

In the experiment detail view, filter for status = failed, select them, and click Re-run. Arize calls your endpoint just for those rows and merges the new results into the existing experiment — so you don’t have to lose the successful rows when chasing one flake.

Concurrency, timeouts, and retries

When you launch an experiment, you can tune:

Concurrency — how many parallel POST requests Arize sends. Default is 10. Lower it if your agent’s downstream API has a rate limit.
Timeout — per-request, in seconds. Default 120s. Raise it for agents with long agent loops (e.g. multi-step research agents).
Retries — for transient failures (5xx, network errors), Arize retries with exponential backoff. Defaults are conservative; raise the retry count for agents you know are flaky.

These settings live in the launch modal under Advanced options.

Running from code or CLI

The agent playground is the UI path. If you’d rather drive it from code (e.g. as part of a CI pipeline), use:

Python SDK — AgentEndpointTask in arize.experiments. See Experiment in code.
ax CLI — ax agent-replay run --dataset-name <ds> --agent-config-name <agent> for headless runs.
REST API — for orchestration from any language.

All three paths produce the same experiment artifacts as the UI path, so you can mix and match (kick off via CI, debug in the UI).

End-to-end example

Walking through the travel-agent demo (registered agent, dataset of 20 travel goals, three presets):

Pick the dataset

Open travel-goals-v1 (20 goals like “Plan a 3-day trip to Tokyo from SF in October”, “Weekend in NYC from Chicago”).

Launch with the baseline preset

New Experiment → Run in Agent Playground → travel-agent → Production baseline (Sonnet 4.5) → Run. Wait ~3 minutes for 20 rows to complete.

Launch a second experiment with Opus

Same dataset, same agent, Opus 4.7 preset. Run. ~5 minutes (Opus is slower).

Compare

Compare Experiments → see per-row output diffs. Opus produced richer itineraries on 16/20 rows, but average latency was 2.3× higher. Pass rate on the “produces a coherent multi-day plan” evaluator: Sonnet 18/20, Opus 20/20.

Inspect a regression

On row 7 (Lisbon), Sonnet picked a $$$$ hotel; Opus picked a $$ one. Open the Sonnet trace, see the search_hotels TOOL span — it ranked by rating, not by max_price constraint. Fix is a system prompt tweak.

Iterate

Update the agent’s system prompt, redeploy, re-run the Sonnet experiment. Compare new Sonnet run to the previous one to confirm row 7 is fixed without breaking anything else.

That loop — run, compare, drill into trace, fix, re-run — is what agent experiments are designed to enable.

Compare experiments

Side-by-side diffs and evaluator deltas across runs.

Run evals on experiments

Add evaluators to score agent outputs.

CI/CD with experiments

Trigger agent experiments from your deploy pipeline.

Experiment in code

Drive agent experiments from Python / TypeScript / CLI.

Quickstart

Instrument

Observe

Evaluate

Improve

Agents

Machine Learning

Settings

Security

Running agent experiments

Prerequisites

Launch an experiment

Watch a run in progress

Inspecting an individual run

Comparing runs

From the experiments tab

Common comparison patterns

Re-running failed rows

Concurrency, timeouts, and retries

Running from code or CLI

End-to-end example

Next

Compare experiments

Run evals on experiments

CI/CD with experiments

Experiment in code

​Prerequisites

​Launch an experiment

​Watch a run in progress

​Inspecting an individual run

​Comparing runs

​From the experiments tab

​Common comparison patterns

​Re-running failed rows

​Concurrency, timeouts, and retries

​Running from code or CLI

​End-to-end example

​Next

Compare experiments

Run evals on experiments

CI/CD with experiments

Experiment in code

Prerequisites

Launch an experiment

Watch a run in progress

Inspecting an individual run

Comparing runs

From the experiments tab

Common comparison patterns

Re-running failed rows

Concurrency, timeouts, and retries

Running from code or CLI

End-to-end example

Next