Skip to main content
Agent experiments let you test a deployed agent end-to-end — routing, tool selection, multi-step orchestration — by hitting your own HTTP endpoint with every row in a dataset and collecting the results as a standard experiment in Arize. Unlike Experiment in playground, which tests a single prompt in isolation, agent experiments exercise the entire agent flow. Your agent runs in your infrastructure; Arize orchestrates the dataset run, captures responses, links traces, and stores everything as comparable experiment runs.

When to use this

Use agent experiments when you want to answer questions like:
  • Does changing a router prompt fix tool selection across the dataset?
  • How does a model swap on one expert node affect the full supervisor agent’s outputs?
  • Did a new system prompt break downstream tool calls?
  • How do different parameter combinations compare on the same realistic inputs?
If the change you want to test fits inside a single prompt, Experiment in playground is faster. If the change spans multiple LLM calls, retrieval, tool execution, or routing, agent experiments are the right tool.

How it works

Agent experiment flow diagram
1

You deploy your agent behind an HTTP endpoint

Any framework — LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, or custom code — works. The only requirement is a POST endpoint that accepts JSON and returns JSON.
2

You register the endpoint in Arize

In Space Settings → Agents, you add an Agent Configuration: the endpoint URL, auth headers, and a JSON Schema for the request body. See Setting up your agent endpoint.
3

You run an experiment against a dataset

From the dataset page, pick New Experiment → Run against agent, choose the agent configuration, optionally override the config payload, and click Run.
4

Arize POSTs each row to your endpoint

The coordinator hydrates your request template with each dataset row and POSTs in parallel (with retries, timeouts, and rate limiting). Every row produces one experiment run.
5

Traces link back automatically

If your agent is instrumented with Arize tracing, Arize propagates a W3C traceparent header so every span your agent emits becomes a child of the experiment-run trace. See Setting up tracing for agent experiments.

What you get

Every row of the dataset turns into one experiment run with:
  • The full request body sent to your agent
  • The full response body returned
  • Any traces your agent emitted, nested under the experiment-run trace
  • Failure details (HTTP error, timeout) for runs that didn’t complete
  • Evaluator scores, if you attach evaluators
You can then compare experiments the same way you compare prompt-level runs.

What you don’t need to do

  • You don’t write task code in Arize. Your agent already exists; we just call it.
  • You don’t move your model or data. Arize never sees your agent’s internals — only the responses your endpoint returns.
  • You don’t need to be an engineer to run one. Once an engineer registers the agent configuration, anyone in the space can kick off an experiment from the UI.

Compared to other workflows

Playground experimentCode experimentAgent experiment
TestsA single promptA Python functionA deployed agent endpoint
Where it runsArize-hostedYour Python runtimeYour hosted infra
Who can run itAnyone in the spaceEngineersAnyone in the space
Multi-step, tool use, routingNoYesYes
Code change requiredNoYesNo
Agent experiments combine the no-code launch of Playground experiments with the multi-step realism of code experiments.

Next steps

Set up your agent endpoint

Register your deployed agent with Arize.

Set up tracing

Link agent traces to experiment runs via trace context propagation.

Run an experiment

Pick a dataset, run, and compare.