> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Set up an experiment

> Turn a finished dataset into an experiment plan: isolate the variable, define a baseline, and choose where to run it

Use experiments to turn a prompt, model, retrieval, pipeline, or agent change into a controlled comparison. Start from a dataset, define the task and baseline, and choose the evaluator signals you want to compare so every run answers the same question.

## What an experiment includes

Experiments typically combine four pieces. A **dataset** is the fixed benchmark you rerun against. A **task** is the prompt, model, pipeline, or agent behavior you want to test. **Evaluators** turn each output into a signal you can compare. A **baseline** gives you the unchanged version to measure against. That fixed setup is what makes experiments more reliable than one-off spot checks on traces or ad hoc examples.

<Frame caption="How datasets, tasks, and evaluators combine to score an experiment">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/improve/set_up_hero.avif" alt="Experiment workflow diagram: a dataset of examples feeds inputs into Run Tasks (app template, eval template, model change, retrieval strategy), whose outputs are sent to an evaluator alongside optional reference outputs from the dataset, producing a score" />
</Frame>

## How to decide what to test

The strongest experiments start with a real failure, not a hunch. Before you write a variant, walk the trace back to the step that broke.

If you do not have traces in Arize AX yet, start by [setting up tracing](/ax/instrument/set-up-tracing) so you can find real failures to turn into experiments.

1. Open a bad outcome in [Traces](/ax/observe/tracing/view-and-manage-traces) — a thumbs-down, a low eval score, an exception, or anything a reviewer flagged.
2. Step through the trace span by span. Was the prompt ambiguous? Did the model ignore an instruction? Did the retriever return the wrong chunk? Did a tool call fail to parse?
3. Identify the step that most directly caused the bad outcome. That step is your experimental variable.

Hold everything else constant and test one thing at a time. If you change the prompt, model, and retriever all at once, you won't know which change moved the score. Split those into separate variants, run each against the same dataset with the same evaluators, and compare.

Common things to test:

* **Prompt change:** wording, few-shot examples, output schema, or tool/function definitions.
* **Model change:** same prompt, different LLM.
* **Invocation parameters:** temperature, top-p, max tokens, and similar.
* **Pipeline, agent, or custom logic:** retrieval strategy, multi-step reasoning, tool-using loops, or any non-prompt code.

See [Create evaluators](/ax/evaluate/create-evaluators) if you need to refine what you're measuring before running the comparison.

## Plan a baseline

Your first run is the unchanged setup: same prompt, model, and parameters you use today. Name it `baseline-original-prompt`, and name variants after the variable under test (`variant-concise-prompt`, `variant-gpt-4o`, `variant-rerank-top-5`) so the comparison view stays readable.

If your task predicts labels, pick the ground-truth column, the prediction column, and the positive-class label now. You'll need them when [comparing runs in Playground](/ax/improve/experiment-in-playground#classification-metrics).

## Choose your path

Once you've identified the failing step, ask where the change actually lives. Use the [Playground](/ax/improve/experiment-in-playground) when the change stays inside a single prompt call. That includes prompt wording, model choice, invocation parameters, and tool-call behavior. You keep the rest of the system fixed and compare the variant directly against your baseline.

Use [code](/ax/improve/experiment-in-code) when the experiment spans more than one prompt or tool call, or when the run happens outside the Playground. That covers harness swaps, multi-step pipelines, agent or subagent architecture, custom sandboxes, and remote runs you want to log from Python, TypeScript, the CLI, or another service. If you start in the Playground and realize the change needs any of that, move it to code.

<CardGroup cols={2}>
  <Card title="Experiment in Playground" icon="window" href="/ax/improve/experiment-in-playground">
    **Single prompt call.** Prompt, model, parameter, or tool-call changes. Update the variant, run it against your dataset, compare results.
  </Card>

  <Card title="Experiment in code" icon="code" href="/ax/improve/experiment-in-code">
    **Complex or remote.** Pipelines, agents, sandboxes, runtime credentials, or runs you want to log from code.
  </Card>
</CardGroup>

## Further reading

* [Improve your agent](/ax/get-started/get-started-improve-your-agent): step-by-step get-started walkthrough that replays a failing production trace in the Playground and validates the fix across a dataset.
* [View and manage traces](/ax/observe/tracing/view-and-manage-traces): find the failing span that becomes your experimental variable.
* [Complete experiments notebook](https://github.com/Arize-ai/tutorials/blob/main/python/llm/experiments/datasets_experiments_quickstart_python.ipynb): end-to-end dataset, evaluator, and comparison walkthrough for the support-agent example.
