> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Running agent experiments

> Launch a dataset against a registered agent endpoint, override config per run, and compare runs in the agent playground.

Once you've [registered an agent](/ax/improve/setup-agent-endpoint) and [wired up tracing](/ax/improve/agent-tracing-context), running an experiment against it is a no-code workflow. This page covers the agent playground UI for kicking off, monitoring, and comparing runs.

## Prerequisites

* A dataset in your space ([Build a dataset](/ax/improve/build-a-dataset)).
* A registered agent configuration ([Setting up your agent endpoint](/ax/improve/setup-agent-endpoint)).
* *Optional but recommended:* tracing wired into your agent ([Setting up tracing](/ax/improve/agent-tracing-context)).

## Launch an experiment

<Steps>
  <Step title="Open the dataset">
    Navigate to **Datasets** in the left nav and click the dataset you want to run against.
  </Step>

  <Step title="Click New Experiment → Run against agent">
    The agent playground modal opens with the dataset already selected.
  </Step>

  <Step title="Pick an agent">
    Use the **Agent** dropdown to select one of your registered agent configurations. The endpoint URL and auth from the configuration are used automatically — you don't re-enter them per run.
  </Step>

  <Step title="Choose a preset (or write custom config)">
    If the agent has [request presets](/ax/improve/setup-agent-endpoint#optional-add-request-presets), pick one from the **Preset** dropdown. The preset's config JSON populates the editor.

    You can edit the JSON before running — useful when you want to tweak one parameter from a known-good baseline without saving a new preset.
  </Step>

  <Step title="Set the body template">
    The **Body Template** field is the JSON Arize sends as `input` in the request. Use `{{column_name}}` to interpolate from each dataset row:

    ```json theme={null}
    {
      "goal": "{{input}}",
      "config": { "model": "claude-sonnet-4-5", "max_turns": 12 }
    }
    ```
  </Step>

  <Step title="(Optional) Limit to a subset">
    For quick sanity checks, run on a subset of the dataset before committing to the full run. Useful for validating the request shape with one or two rows first.
  </Step>

  <Step title="(Optional) Add evaluators">
    Attach evaluators to score each run's output automatically. See [Evals overview](/ax/evaluate/evals-overview) for setup.
  </Step>

  <Step title="Click Run">
    Arize fans out the dataset against your endpoint in parallel, applies retries on transient failures, and streams results into a new experiment.
  </Step>
</Steps>

## Watch a run in progress

As the experiment runs, the experiments view streams rows in real time:

* **Status column** — pending, running, succeeded, failed.
* **Output column** — the agent's response body.
* **Latency / tokens** — populated from the spans your agent emitted (if traced).
* **Evaluator scores** — computed as each row completes.

If a row fails (timeout, HTTP error, agent exception), the error message appears in the run's detail panel along with the request body that was sent — useful for debugging and re-running just the failed subset.

## Inspecting an individual run

Click any row to see:

* **Input** — exact JSON body sent to your endpoint, including the hydrated `input` and the `arize_metadata` Arize appended.
* **Output** — full response body returned.
* **Trace** — if your agent is traced, the linked trace tree (CHAIN, LLM, TOOL spans).
* **Headers** — the request headers Arize sent, including `traceparent`, `Authorization`, and any custom headers you configured.
* **Evaluator scores** — per-evaluator pass/fail and reasoning.

The trace link is the highest-leverage debugging tool: when an agent run produces an unexpected output, opening the trace shows you which tool was called, what the LLM was prompted with at each turn, and where the chain diverged.

## Comparing runs

After you have two or more experiments on the same dataset, comparison is the point.

### From the experiments tab

Open **Datasets → your dataset → Experiments**, multi-select the runs you want to compare, and click **Compare**. The comparison view shows:

* **Side-by-side outputs** for each dataset row across the selected runs.
* **Evaluator deltas** — which rows improved, regressed, or stayed flat.
* **Summary metrics** — pass rate, average latency, token counts per run.
* **Tool-call patterns** — if traced, you can see which runs called different tools or took different paths.

### Common comparison patterns

| Question                                       | How to set it up                                                                |
| ---------------------------------------------- | ------------------------------------------------------------------------------- |
| Did the new model help?                        | Run two experiments, same dataset and preset, vary only `config.model`.         |
| Did the prompt change break anything?          | Run baseline before the deploy, then run again after. Compare evaluator deltas. |
| Which preset is best for prod?                 | Run each preset against the same dataset. Eyeball the summary table.            |
| Is this regression specific to one input type? | Filter the comparison view by a metadata column (e.g. `category`).              |

## Re-running failed rows

In the experiment detail view, filter for **status = failed**, select them, and click **Re-run**. Arize calls your endpoint just for those rows and merges the new results into the existing experiment — so you don't have to lose the successful rows when chasing one flake.

## Concurrency, timeouts, and retries

When you launch an experiment, you can tune:

* **Concurrency** — how many parallel POST requests Arize sends. Default is 10. Lower it if your agent's downstream API has a rate limit.
* **Timeout** — per-request, in seconds. Default 120s. Raise it for agents with long agent loops (e.g. multi-step research agents).
* **Retries** — for transient failures (5xx, network errors), Arize retries with exponential backoff. Defaults are conservative; raise the retry count for agents you know are flaky.

These settings live in the launch modal under **Advanced options**.

## Running from code or CLI

The agent playground is the UI path. If you'd rather drive it from code (e.g. as part of a CI pipeline), use:

* **Python SDK** — `AgentEndpointTask` in `arize.experiments`. See [Experiment in code](/ax/improve/experiment-in-code).
* **`ax` CLI** — `ax agent-replay run --dataset-name <ds> --agent-config-name <agent>` for headless runs.
* **REST API** — for orchestration from any language.

All three paths produce the same experiment artifacts as the UI path, so you can mix and match (kick off via CI, debug in the UI).

## End-to-end example

Walking through the travel-agent demo (registered agent, dataset of 20 travel goals, three presets):

<Steps>
  <Step title="Pick the dataset">
    Open `travel-goals-v1` (20 goals like *"Plan a 3-day trip to Tokyo from SF in October"*, *"Weekend in NYC from Chicago"*).
  </Step>

  <Step title="Launch with the baseline preset">
    **New Experiment → Run against agent → travel-agent → Production baseline (Sonnet 4.5) → Run**. Wait \~3 minutes for 20 rows to complete.
  </Step>

  <Step title="Launch a second experiment with Opus">
    Same dataset, same agent, **Opus 4.7** preset. Run. \~5 minutes (Opus is slower).
  </Step>

  <Step title="Compare">
    **Compare Experiments** → see per-row output diffs. Opus produced richer itineraries on 16/20 rows, but average latency was 2.3× higher. Pass rate on the "produces a coherent multi-day plan" evaluator: Sonnet 18/20, Opus 20/20.
  </Step>

  <Step title="Inspect a regression">
    On row 7 (Lisbon), Sonnet picked a \$\$\$\$ hotel; Opus picked a \$\$ one. Open the Sonnet trace, see the `search_hotels` TOOL span — it ranked by rating, not by `max_price` constraint. Fix is a system prompt tweak.
  </Step>

  <Step title="Iterate">
    Update the agent's system prompt, redeploy, re-run the Sonnet experiment. Compare new Sonnet run to the previous one to confirm row 7 is fixed without breaking anything else.
  </Step>
</Steps>

That loop — run, compare, drill into trace, fix, re-run — is what agent experiments are designed to enable.

## Next

<CardGroup cols={2}>
  <Card title="Compare experiments" href="/ax/improve/experiment-in-playground#compare-experiments">
    Side-by-side diffs and evaluator deltas across runs.
  </Card>

  <Card title="Run evals on experiments" href="/ax/evaluate/run-evals-on-experiments">
    Add evaluators to score agent outputs.
  </Card>

  <Card title="CI/CD with experiments" href="/ax/improve/ci-cd-for-automated-experiments">
    Trigger agent experiments from your deploy pipeline.
  </Card>

  <Card title="Experiment in code" href="/ax/improve/experiment-in-code">
    Drive agent experiments from Python / TypeScript / CLI.
  </Card>
</CardGroup>
