Skip to main content
You’ve already set up the experiment. The dataset is ready, the baseline is planned, and the change you want to test fits inside a prompt, a model swap, or an invocation parameter. In the Arize AX Playground, load a prompt, attach the dataset, add evaluators, run the experiment, and compare variants without writing code.
Playground section of the Arize AX UI showing a list of experiments run against a dataset

Get your prompt

Get the prompt into the Playground from a saved version, a new draft, or a replayed production span. For the full Improve Prompts workflow, start with What are prompts?. Before your first run, confirm the LLM provider, model, and invocation parameters for the variant you want to test.
Use the Arize skills plugin with the arize-trace skill to pull existing prompts out of production traces, then the arize-prompt-optimization skill to revise one. Try asking your agent:
  • “Extract my current support classifier prompt from the last day of traces.”
  • “Pull the system prompt from my latest LLM spans and show me what it looks like.”
  • “Revise this classifier prompt to be more concise and keep the category list.”
Coding agent running the arize-prompt-optimization skill via the ax CLI to extract a production prompt from traces and revise it without leaving the editor

Attach your dataset

Attach the dataset so the Playground can run the prompt across every example.
The Playground Agent can attach a dataset and map columns for you:“Attach my billing-issues dataset and map the input column.”“Attach the regression-test dataset and use only the question and answer columns.”“Use the support-tickets dataset and map ticket_text to the input variable.”
Use the Playground Agent to attach a dataset and map columns

Add an evaluator

An evaluator scores each experiment output: a code-based check, an LLM-as-a-judge template, or a pre-built evaluator from Evaluators. Add one before you run so results are scored automatically.
The Playground Agent can attach evaluators for you:“Add my correctness evaluator to this experiment.”“Attach the helpfulness evaluator from Evaluators.”“Score this run with my groundedness evaluator.”
Use the Playground Agent to attach an evaluator to your experiment

Run the experiment

Run the prompt against the dataset. The evaluators you attached score each output automatically.
The Playground Agent can start the run and summarize the first results for you:“Run the experiment on my dataset.”“Run this prompt against all 50 examples.”“Run this as a new experiment and summarize the results.”
Use the Playground Agent to run an experiment and summarize results

Compare experiments

Once you have at least two experiments on the same dataset, compare them to see what changed and whether the new run improved on the baseline.
Alyx can summarize and compare experiment results:“Summarize the differences between my last two experiments.”“Highlight the key regressions in experiment B vs A.”“What’s the average helpfulness score for each experiment?”
Use Alyx to compare experiment results

Additional Playground workflows

Once the main experiment loop is in place, the Playground also supports these workflows.

Compare prompts side-by-side

The Playground lets you compare up to three prompt variants at once. Use the + button on the right to add more variants, then run each one against the same dataset and compare outputs without leaving the Playground.
Prompt Playground showing three prompt variants side by side for comparison

Production replay

Load any LLM span from your traces directly into the Playground. Arize fills in the prompt template, input messages, variables, parameters, and function definitions so you can reproduce the exact call without rebuilding it by hand.
Load a production span into the Playground for replay and iteration

If you need code

Experiment in code

Jump to the code path when your task outgrows the Playground: pipelines, agents, custom sandboxes, or anything that needs secrets and tracing.

Further reading