Agent experiments overview

Agent experiments let you test a deployed agent end-to-end — routing, tool selection, multi-step orchestration — by hitting your own HTTP endpoint with every row in a dataset and collecting the results as a standard experiment in Arize. Unlike Experiment in playground, which tests a single prompt in isolation, agent experiments exercise the entire agent flow. Your agent runs in your infrastructure; Arize orchestrates the dataset run, captures responses, links traces, and stores everything as comparable experiment runs.

When to use this

Use agent experiments when you want to answer questions like:

Does changing a router prompt fix tool selection across the dataset?
How does a model swap on one expert node affect the full supervisor agent’s outputs?
Did a new system prompt break downstream tool calls?
How do different parameter combinations compare on the same realistic inputs?

If the change you want to test fits inside a single prompt, Experiment in playground is faster. If the change spans multiple LLM calls, retrieval, tool execution, or routing, agent experiments are the right tool.

How it works

Agent experiment flow diagram — Agent experiment flow: dataset → Arize coordinator → your agent → experiment runs + traces

You deploy your agent behind an HTTP endpoint

Any framework — LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, or custom code — works. The only requirement is a POST endpoint that accepts JSON and returns JSON.

You register the endpoint in Arize

From the left navigation, open Agent Endpoints and add a new endpoint: the URL, auth headers, and a JSON Schema for the request body. See Setting up your agent endpoint.

You run an experiment against a dataset

From the dataset page, pick New Experiment → Run in Agent Playground, choose the agent configuration, optionally override the config payload, and click Run.

Arize POSTs each row to your endpoint

The coordinator hydrates your request template with each dataset row and POSTs in parallel (with retries, timeouts, and rate limiting). Every row produces one experiment run.

Traces link back automatically

If your agent is instrumented with Arize tracing, Arize propagates a W3C traceparent header so every span your agent emits becomes a child of the experiment-run trace. See Setting up tracing for agent experiments.

Example: travel-agent run

Suppose your dataset has a row like:

{
  "input": "Plan a 3-day trip to Tokyo from SF in October",
  "budget": "mid-range"
}

In the Agent Playground, set the body template to:

{
  "goal": "{{dataset.input}}",
  "config": {
    "budget": "{{dataset.budget}}",
    "model": "claude-sonnet-4-6"
  }
}

For that row, Arize sends your endpoint:

{
  "input": {
    "goal": "Plan a 3-day trip to Tokyo from SF in October",
    "config": {
      "budget": "mid-range",
      "model": "claude-sonnet-4-6"
    }
  },
  "arize_metadata": {
    "dataset_id": "abc...",
    "experiment_id": "exp...",
    "run_id": "run...",
    "example_id": "ex...",
    "space_id": "sp..."
  }
}

Your agent returns JSON, such as { "final_response": "...", "tool_calls": [...] }, and Arize stores that response as the experiment output for the dataset row. If tracing is configured, the LLM calls and tool calls from that run link back to the same experiment row.

What you get

Every row of the dataset turns into one experiment run with:

The full request body sent to your agent
The full response body returned
Any traces your agent emitted, nested under the experiment-run trace
Failure details (HTTP error, timeout) for runs that didn’t complete
Evaluator scores, if you attach evaluators

You can then compare experiments the same way you compare prompt-level runs.

What you don’t need to do

You don’t write task code in Arize. Your agent already exists; we just call it.
You don’t move your model or data. Arize never sees your agent’s internals — only the responses your endpoint returns.
You don’t need to be an engineer to run one. Once an engineer registers the agent configuration, anyone in the space can kick off an experiment from the UI.

Compared to other workflows

	Playground experiment	Code experiment	Agent experiment
Tests	A single prompt	A Python function	A deployed agent endpoint
Where it runs	Arize-hosted	Your Python runtime	Your hosted infra
Who can run it	Anyone in the space	Engineers	Anyone in the space
Multi-step, tool use, routing	No	Yes	Yes
Code change required	No	Yes	No

Agent experiments combine the no-code launch of Playground experiments with the multi-step realism of code experiments.

Quickstart

Instrument

Observe

Evaluate

Improve

Agents

Machine Learning

Settings

Security

Agent experiments overview

When to use this

How it works

Example: travel-agent run

What you get

What you don’t need to do

Compared to other workflows

Next steps

Set up your agent endpoint

Set up tracing

Run an experiment

​When to use this

​How it works

​Example: travel-agent run

​What you get

​What you don’t need to do

​Compared to other workflows

​Next steps

Set up your agent endpoint

Set up tracing

Run an experiment

When to use this

How it works

Example: travel-agent run

What you get

What you don’t need to do

Compared to other workflows

Next steps