Experiment in Playground

You’ve already set up the experiment. The dataset is ready, the baseline is planned, and the change you want to test fits inside a prompt, a model swap, or an invocation parameter. In the Arize AX Playground, load a prompt, attach the dataset, add evaluators, run the experiment, and compare variants without writing code.

Playground section of the Arize AX UI showing a list of experiments run against a dataset

Get your prompt

Get the prompt into the Playground from a saved version, a new draft, or a replayed production span. For the full Improve Prompts workflow, start with What are prompts?. Before your first run, confirm the LLM provider, model, and invocation parameters for the variant you want to test.

By Arize Skills
By Alyx
By UI

Use the Arize skills plugin with the arize-trace skill to pull existing prompts out of production traces, then the arize-prompt-optimization skill to revise one. Try asking your agent:

“Extract my current support classifier prompt from the last day of traces.”
“Pull the system prompt from my latest LLM spans and show me what it looks like.”
“Revise this classifier prompt to be more concise and keep the category list.”

Coding agent running the arize-prompt-optimization skill via the ax CLI to extract a production prompt from traces and revise it without leaving the editor

Three entry points to get a prompt into the Playground:

From Prompts: open the version you want to test, or click + New Prompt. Tag the version with a label your app references (for example, production) so experiments can load it explicitly. For the full walkthrough on LLM provider, messages, tools, invocation parameters, and saving, see Create a prompt.
Pasted inline: open the Playground and paste the prompt directly into the editor. Useful for quick tests. Save it to Prompts once you’re ready to version it.
Replayed from a production span: see Production replay below.

Prompt opened in the Arize AX Prompt Playground from the Prompts section

Attach your dataset

Attach the dataset so the Playground can run the prompt across every example.

By Alyx
By UI

The Playground Agent can attach a dataset and map columns for you:“Attach my billing-issues dataset and map the input column.”“Attach the regression-test dataset and use only the question and answer columns.”“Use the support-tickets dataset and map ticket_text to the input variable.”

Use the Playground Agent to attach a dataset and map columns

Choose your dataset from the Select a Dataset dropdown.
Confirm the mapping between dataset columns and the {variable} placeholders in your prompt. If your prompt expects {ticket_text}, map the dataset’s ticket_text column to that variable. Datasets built from trace spans may expose a prefixed column like attributes.llm.prompt_template.variables.ticket_text; either form maps to the same variable.

For more on OpenInference dataset columns and prompt-template metadata, see Build a dataset.

Select a Dataset dropdown in the Playground with columns mapped to prompt variables

Add an evaluator

An evaluator scores each experiment output: a code-based check, an LLM-as-a-judge template, or a pre-built evaluator from Evaluators. Add one before you run so results are scored automatically.

By Alyx
By UI

The Playground Agent can attach evaluators for you:“Add my correctness evaluator to this experiment.”“Attach the helpfulness evaluator from Evaluators.”“Score this run with my groundedness evaluator.”

Use the Playground Agent to attach an evaluator to your experiment

Run the experiment

Run the prompt against the dataset. The evaluators you attached score each output automatically.

By Alyx
By UI

The Playground Agent can start the run and summarize the first results for you:“Run the experiment on my dataset.”“Run this prompt against all 50 examples.”“Run this as a new experiment and summarize the results.”

Use the Playground Agent to run an experiment and summarize results

Compare experiments

Once you have at least two experiments on the same dataset, compare them to see what changed and whether the new run improved on the baseline.

By Alyx
By UI

Alyx can summarize and compare experiment results:“Summarize the differences between my last two experiments.”“Highlight the key regressions in experiment B vs A.”“What’s the average helpfulness score for each experiment?”

Open your dataset’s Experiments tab, select the runs you want to analyze, and click Compare Experiments.

Dataset's Experiments tab in Arize AX with multiple runs selected and the Compare Experiments button highlighted

The comparison view opens with outputs, evaluator results, and metadata side by side. Hide columns you don’t need, or switch to Charting View to visualize results across runs.Enable Diff Output Mode to compare the text of experiment outputs against a baseline. The UI highlights insertions, deletions, and other changes, so subtle prompt differences are easier to inspect.

Compare Experiments view with Diff Output Mode enabled, highlighting text-level insertions and deletions against a baseline

Other views on the same comparison:

Diff Mode: pick a baseline experiment and the UI highlights evaluator differences plus summary metrics relative to that baseline.
Export: download the comparison as CSV for offline analysis or sharing.

Classification metrics

Use classification metrics when each experiment returns a categorical label instead of free-form text. Arize compares the predicted label in each run against the ground-truth label in your dataset.Before you configure it, make sure:

your dataset has a column with categorical ground truth labels such as expected_category or true_label
your experiments produce an output column with predicted labels such as output or predicted_label
both columns use clean categorical strings

To turn it on:

Open Metrics Settings from the Experiments tab on the dataset.
Choose the ground-truth column from your dataset.
Choose the predicted column from your experiment outputs.
Pick the positive class and click Done.

If you change the ground-truth column, you’ll need to select the positive class again.Arize computes the following binary classification metrics using the selected positive class. Rows where either the ground-truth value or the predicted value is null are excluded.

Metric	Formula	Description
Accuracy	`(TP + TN) / Total`	Fraction of predictions that match the ground truth
Precision	`TP / (TP + FP)`	Of all positive predictions, how many are correct
Recall	`TP / (TP + FN)`	Of all actual positives, how many were predicted
F1	`2 * TP / (2 * TP + FP + FN)`	Harmonic mean of Precision and Recall

Where:

TP (True Positive): predicted and ground truth both match the positive class
TN (True Negative): neither predicted nor ground truth match the positive class
FP (False Positive): predicted matches the positive class, ground truth does not
FN (False Negative): ground truth matches the positive class, predicted does not

Settings persist per dataset version, so you only need to configure them once unless the underlying column setup changes.Once a variant beats your baseline, tag that prompt version in Prompts as production (or the label your app loads by) so your application picks it up on the next deploy. For version history, labels, and production tags, see Save and version prompts.

Additional Playground workflows

Once the main experiment loop is in place, the Playground also supports these workflows.

Compare prompts side-by-side

The Playground lets you compare up to three prompt variants at once. Use the + button on the right to add more variants, then run each one against the same dataset and compare outputs without leaving the Playground.

Prompt Playground showing three prompt variants side by side for comparison

Production replay

Load any LLM span from your traces directly into the Playground. Arize fills in the prompt template, input messages, variables, parameters, and function definitions so you can reproduce the exact call without rebuilding it by hand.

Load a production span into the Playground for replay and iteration

If you need code

Experiment in code

Jump to the code path when your task outgrows the Playground: pipelines, agents, custom sandboxes, or anything that needs secrets and tracing.

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

Experiment in Playground

Get your prompt

Attach your dataset

Add an evaluator

Run the experiment

Compare experiments

Classification metrics

Additional Playground workflows

Compare prompts side-by-side

Production replay

If you need code

Experiment in code

Further reading

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

​Get your prompt

​Attach your dataset

​Add an evaluator

​Run the experiment

​Compare experiments

​Classification metrics

​Additional Playground workflows

​Compare prompts side-by-side

​Production replay

​If you need code

Experiment in code

​Further reading

Get your prompt

Attach your dataset

Add an evaluator

Run the experiment

Compare experiments

Classification metrics

Additional Playground workflows

Compare prompts side-by-side

Production replay

If you need code

Further reading