What an experiment includes
Experiments typically combine four pieces. A dataset is the fixed benchmark you rerun against. A task is the prompt, model, pipeline, or agent behavior you want to test. Evaluators turn each output into a signal you can compare. A baseline gives you the unchanged version to measure against. That fixed setup is what makes experiments more reliable than one-off spot checks on traces or ad hoc examples.
How to decide what to test
The strongest experiments start with a real failure, not a hunch. Before you write a variant, walk the trace back to the step that broke. If you do not have traces in Arize yet, start by setting up tracing so you can find real failures to turn into experiments.- Open a bad outcome in Traces — a thumbs-down, a low eval score, an exception, or anything a reviewer flagged.
- Step through the trace span by span. Was the prompt ambiguous? Did the model ignore an instruction? Did the retriever return the wrong chunk? Did a tool call fail to parse?
- Identify the step that most directly caused the bad outcome. That step is your experimental variable.
- Prompt change: wording, few-shot examples, output schema, or tool/function definitions.
- Model change: same prompt, different LLM.
- Invocation parameters: temperature, top-p, max tokens, and similar.
- Pipeline, agent, or custom logic: retrieval strategy, multi-step reasoning, tool-using loops, or any non-prompt code.
Plan a baseline
Your first run is the unchanged setup: same prompt, model, and parameters you use today. Name itbaseline-original-prompt, and name variants after the variable under test (variant-concise-prompt, variant-gpt-4o, variant-rerank-top-5) so the comparison view stays readable.
If your task predicts labels, pick the ground-truth column, the prediction column, and the positive-class label now. You’ll need them when comparing runs in Playground.
Choose your path
Once you’ve identified the failing step, ask where the change actually lives. Use the Playground when the change stays inside a single prompt call. That includes prompt wording, model choice, invocation parameters, and tool-call behavior. You keep the rest of the system fixed and compare the variant directly against your baseline. Use code when the experiment spans more than one prompt or tool call, or when the run happens outside the Playground. That covers harness swaps, multi-step pipelines, agent or subagent architecture, custom sandboxes, and remote runs you want to log from Python, TypeScript, the CLI, or another service. If you start in the Playground and realize the change needs any of that, move it to code.Experiment in Playground
Single prompt call. Prompt, model, parameter, or tool-call changes. Update the variant, run it against your dataset, compare results.
Experiment in code
Complex or remote. Pipelines, agents, sandboxes, runtime credentials, or runs you want to log from code.
Further reading
- Improve your agent: step-by-step get-started walkthrough that replays a failing production trace in the Playground and validates the fix across a dataset.
- View and manage traces: find the failing span that becomes your experimental variable.
- Complete experiments notebook: end-to-end dataset, evaluator, and comparison walkthrough for the support-agent example.