
Get your prompt
Get the prompt into the Playground from a saved version, a new draft, or a replayed production span. For the full Improve Prompts workflow, start with What are prompts?. Before your first run, confirm the LLM provider, model, and invocation parameters for the variant you want to test.- By Arize Skills
- By Alyx
- By UI
Use the Arize skills plugin with the 
arize-trace skill to pull existing prompts out of production traces, then the arize-prompt-optimization skill to revise one. Try asking your agent:- “Extract my current support classifier prompt from the last day of traces.”
- “Pull the system prompt from my latest LLM spans and show me what it looks like.”
- “Revise this classifier prompt to be more concise and keep the category list.”

Attach your dataset
Attach the dataset so the Playground can run the prompt across every example.- By Alyx
- By UI
The Playground Agent can attach a dataset and map columns for you:“Attach my billing-issues dataset and map the input column.”“Attach the regression-test dataset and use only the question and answer columns.”“Use the support-tickets dataset and map ticket_text to the input variable.”

Add an evaluator
An evaluator scores each experiment output: a code-based check, an LLM-as-a-judge template, or a pre-built evaluator from Evaluators. Add one before you run so results are scored automatically.- By Alyx
- By UI
The Playground Agent can attach evaluators for you:“Add my correctness evaluator to this experiment.”“Attach the helpfulness evaluator from Evaluators.”“Score this run with my groundedness evaluator.”

Run the experiment
Run the prompt against the dataset. The evaluators you attached score each output automatically.- By Alyx
- By UI
The Playground Agent can start the run and summarize the first results for you:“Run the experiment on my dataset.”“Run this prompt against all 50 examples.”“Run this as a new experiment and summarize the results.”

Compare experiments
Once you have at least two experiments on the same dataset, compare them to see what changed and whether the new run improved on the baseline.- By Alyx
- By UI
Alyx can summarize and compare experiment results:“Summarize the differences between my last two experiments.”“Highlight the key regressions in experiment B vs A.”“What’s the average helpfulness score for each experiment?”

Additional Playground workflows
Once the main experiment loop is in place, the Playground also supports these workflows.Compare prompts side-by-side
The Playground lets you compare up to three prompt variants at once. Use the + button on the right to add more variants, then run each one against the same dataset and compare outputs without leaving the Playground.
Production replay
Load any LLM span from your traces directly into the Playground. Arize fills in the prompt template, input messages, variables, parameters, and function definitions so you can reproduce the exact call without rebuilding it by hand.
If you need code
Experiment in code
Jump to the code path when your task outgrows the Playground: pipelines, agents, custom sandboxes, or anything that needs secrets and tracing.
Further reading
- View and manage traces: find the next failure mode to turn into dataset rows.
- Human review: reviewer labels on experiment rows that feed back into datasets.






