Skip to main content
Use experiments to turn a prompt, model, retrieval, pipeline, or agent change into a controlled comparison. Start from a dataset, define the task and baseline, and choose the evaluator signals you want to compare so every run answers the same question.

What an experiment includes

Experiments typically combine four pieces. A dataset is the fixed benchmark you rerun against. A task is the prompt, model, pipeline, or agent behavior you want to test. Evaluators turn each output into a signal you can compare. A baseline gives you the unchanged version to measure against. That fixed setup is what makes experiments more reliable than one-off spot checks on traces or ad hoc examples.
Experiment workflow diagram: a dataset of examples feeds inputs into Run Tasks (app template, eval template, model change, retrieval strategy), whose outputs are sent to an evaluator alongside optional reference outputs from the dataset, producing a score

How to decide what to test

The strongest experiments start with a real failure, not a hunch. Before you write a variant, walk the trace back to the step that broke. If you do not have traces in Arize yet, start by setting up tracing so you can find real failures to turn into experiments.
  1. Open a bad outcome in Traces — a thumbs-down, a low eval score, an exception, or anything a reviewer flagged.
  2. Step through the trace span by span. Was the prompt ambiguous? Did the model ignore an instruction? Did the retriever return the wrong chunk? Did a tool call fail to parse?
  3. Identify the step that most directly caused the bad outcome. That step is your experimental variable.
Hold everything else constant and test one thing at a time. If you change the prompt, model, and retriever all at once, you won’t know which change moved the score. Split those into separate variants, run each against the same dataset with the same evaluators, and compare. Common things to test:
  • Prompt change: wording, few-shot examples, output schema, or tool/function definitions.
  • Model change: same prompt, different LLM.
  • Invocation parameters: temperature, top-p, max tokens, and similar.
  • Pipeline, agent, or custom logic: retrieval strategy, multi-step reasoning, tool-using loops, or any non-prompt code.
See Create evaluators if you need to refine what you’re measuring before running the comparison.

Plan a baseline

Your first run is the unchanged setup: same prompt, model, and parameters you use today. Name it baseline-original-prompt, and name variants after the variable under test (variant-concise-prompt, variant-gpt-4o, variant-rerank-top-5) so the comparison view stays readable. If your task predicts labels, pick the ground-truth column, the prediction column, and the positive-class label now. You’ll need them when comparing runs in Playground.

Choose your path

Once you’ve identified the failing step, ask where the change actually lives. Use the Playground when the change stays inside a single prompt call. That includes prompt wording, model choice, invocation parameters, and tool-call behavior. You keep the rest of the system fixed and compare the variant directly against your baseline. Use code when the experiment spans more than one prompt or tool call, or when the run happens outside the Playground. That covers harness swaps, multi-step pipelines, agent or subagent architecture, custom sandboxes, and remote runs you want to log from Python, TypeScript, the CLI, or another service. If you start in the Playground and realize the change needs any of that, move it to code.

Experiment in Playground

Single prompt call. Prompt, model, parameter, or tool-call changes. Update the variant, run it against your dataset, compare results.

Experiment in code

Complex or remote. Pipelines, agents, sandboxes, runtime credentials, or runs you want to log from code.

Further reading

  • Improve your agent: step-by-step get-started walkthrough that replays a failing production trace in the Playground and validates the fix across a dataset.
  • View and manage traces: find the failing span that becomes your experimental variable.
  • Complete experiments notebook: end-to-end dataset, evaluator, and comparison walkthrough for the support-agent example.