> ## Documentation Index > Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt > Use this file to discover all available pages before exploring further. # Agent experiments overview > Test deployed agents end-to-end by running a dataset against a customer-hosted agent endpoint and collecting the results as an experiment. Agent experiments let you test a deployed agent end-to-end — routing, tool selection, multi-step orchestration — by hitting your own HTTP endpoint with every row in a dataset and collecting the results as a standard experiment in Arize. Unlike [Experiment in playground](/ax/improve/experiment-in-playground), which tests a single prompt in isolation, agent experiments exercise the **entire agent flow**. Your agent runs in your infrastructure; Arize orchestrates the dataset run, captures responses, links traces, and stores everything as comparable experiment runs. ## When to use this Use agent experiments when you want to answer questions like: * Does changing a router prompt fix tool selection across the dataset? * How does a model swap on one expert node affect the full supervisor agent's outputs? * Did a new system prompt break downstream tool calls? * How do different parameter combinations compare on the same realistic inputs? If the change you want to test fits inside a single prompt, [Experiment in playground](/ax/improve/experiment-in-playground) is faster. If the change spans multiple LLM calls, retrieval, tool execution, or routing, agent experiments are the right tool. ## How it works Agent experiment flow diagram

Any framework — LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, or custom code — works. The only requirement is a `POST` endpoint that accepts JSON and returns JSON. From the left navigation, open **Agent Endpoints** and add a new endpoint: the URL, auth headers, and a JSON Schema for the request body. See [Setting up your agent endpoint](/ax/improve/setup-agent-endpoint). From the dataset page, pick **New Experiment → Run in Agent Playground**, choose the agent configuration, optionally override the config payload, and click **Run**. The coordinator hydrates your request template with each dataset row and POSTs in parallel (with retries, timeouts, and rate limiting). Every row produces one experiment run. If your agent is instrumented with Arize tracing, Arize propagates a W3C `traceparent` header so every span your agent emits becomes a child of the experiment-run trace. See [Setting up tracing for agent experiments](/ax/improve/agent-tracing-context). ## Example: travel-agent run Suppose your dataset has a row like: ```json theme={null} { "input": "Plan a 3-day trip to Tokyo from SF in October", "budget": "mid-range" } ``` In the Agent Playground, set the body template to: ```json theme={null} { "goal": "{{dataset.input}}", "config": { "budget": "{{dataset.budget}}", "model": "claude-sonnet-4-6" } } ``` For that row, Arize sends your endpoint: ```json theme={null} { "input": { "goal": "Plan a 3-day trip to Tokyo from SF in October", "config": { "budget": "mid-range", "model": "claude-sonnet-4-6" } }, "arize_metadata": { "dataset_id": "abc...", "experiment_id": "exp...", "run_id": "run...", "example_id": "ex...", "space_id": "sp..." } } ``` Your agent returns JSON, such as `{ "final_response": "...", "tool_calls": [...] }`, and Arize stores that response as the experiment output for the dataset row. If tracing is configured, the LLM calls and tool calls from that run link back to the same experiment row. ## What you get Every row of the dataset turns into one experiment run with: * The full request body sent to your agent * The full response body returned * Any traces your agent emitted, nested under the experiment-run trace * Failure details (HTTP error, timeout) for runs that didn't complete * Evaluator scores, if you attach evaluators You can then [compare experiments](/ax/improve/experiment-in-playground#compare-experiments) the same way you compare prompt-level runs. ## What you don't need to do * **You don't write task code in Arize.** Your agent already exists; we just call it. * **You don't move your model or data.** Arize never sees your agent's internals — only the responses your endpoint returns. * **You don't need to be an engineer to run one.** Once an engineer registers the agent configuration, anyone in the space can kick off an experiment from the UI. ## Compared to other workflows | | Playground experiment | Code experiment | **Agent experiment** | | ----------------------------- | --------------------- | ------------------- | ------------------------- | | Tests | A single prompt | A Python function | A deployed agent endpoint | | Where it runs | Arize-hosted | Your Python runtime | Your hosted infra | | Who can run it | Anyone in the space | Engineers | Anyone in the space | | Multi-step, tool use, routing | No | Yes | Yes | | Code change required | No | Yes | No | Agent experiments combine the no-code launch of Playground experiments with the multi-step realism of code experiments. ## Next steps Register your deployed agent with Arize. Link agent traces to experiment runs via trace context propagation. Pick a dataset, run, and compare.