Test Your Prompt in the Playground

You’ve created a trip planner prompt and saved it to Prompt Hub. Now it’s time to see how it performs across real examples. The Prompt Playground lets you load a dataset, run your prompt against every row, and attach evaluators to measure quality — all without writing any code. By the end of this tutorial, you’ll have an experiment showing how your trip planner prompt handles different destinations, durations, and travel styles.

Step 1: Create Your Dataset

Download Examples

Before testing in the Playground, you need a dataset in Arize AX. Navigate to Datasets and Experiments in the left sidebar. Upload this small trip planner dataset, which contains examples spanning destinations like Istanbul, Dubai, San Francisco, Bangkok, Reykjavik, Barcelona, Cape Town, and New York — each with different durations and travel styles.

Download Trip Planner Dataset

Your dataset includes the following columns that map to template variables in your prompt:

Column	Maps to Variable
`attributes.llm.prompt_template.variables.destination`	`{destination}`
`attributes.llm.prompt_template.variables.duration`	`{duration}`
`attributes.llm.prompt_template.variables.travel_style`	`{travel_style}`
`attributes.llm.prompt_template.variables.research`	`{research}`
`attributes.llm.prompt_template.variables.budget_info`	`{budget_info}`
`attributes.llm.prompt_template.variables.local_info`	`{local_info}`

The dataset also includes an output_content column with existing model outputs. You can use this as a reference when evaluating your prompt’s performance.

Create Dataset in Arize AX

Upload via UI
Upload via SDK

In the Datasets and Experiments section, create a new dataset, name it (ex: trip-planner-dataset), and upload the trip_planner_dataset.csv file you downloaded.

Use the Arize SDK to create the dataset from the CSV:

import pandas as pd
from arize import ArizeClient

client = ArizeClient(api_key=os.getenv("ARIZE_API_KEY"))

dataset_df = pd.read_csv("trip_planner_dataset.csv")

dataset = client.datasets.create(
    name="trip-planner-examples",
    space_id=os.getenv("ARIZE_SPACE_ID"),
    examples=dataset_df,
)

Step 2: Open the Playground and Load Your Prompt

Navigate to Playgrounds from the left sidebar.
Create a new Playground View and name it (e.g. trip-planner-test).
Use the Load a prompt dropdown to load the first version of your trip-planner prompt.
Your system message and user message template will populate automatically, including all the {variable} placeholders.

Step 3: Attach Your Dataset

Use the Select a Dataset dropdown to choose trip-planner-dataset
The Playground will map your dataset columns to the template variables in your prompt

Once the dataset is loaded, you’ll see the template variables highlighted in your prompt. Each row in the dataset represents a different trip scenario — from a 2-week standard trip to Istanbul, to a long weekend adventure in Reykjavik, to a family-friendly week in Istanbul. For more details on testing prompts on datasets, see the Test prompts on datasets guide.

Selecting a dataset in the Prompt Playground

Step 4: Add Evaluators

Before running the experiment, you can attach evaluators to measure the quality of the generated itineraries. In this tutorial, we’ll create a custom LLM-as-a-Judge evaluator that checks whether the trip planner’s output meets our specific quality criteria.

Create a Custom LLM-as-a-Judge Evaluator

Select Add Evaluator: Click Add Evaluator and choose LLM-as-a-Judge, then select Create From Blank to define your own evaluation criteria.
Name the evaluator: Give it a descriptive name like “Trip Plan Completeness”. This name appears in your experiment results, so choose something that clearly communicates what the evaluator measures.
Pick a judge model: Select the LLM that will serve as the judge.
- Choose a model — use a different model than the one generating itineraries (e.g., use gpt-5 as the judge if gpt-5-mini generated the output)
Define the evaluation template: The prompt template is the core of your evaluator. It describes the judge’s role, the criteria for each label, and the data it will see. Here’s a template tailored to our trip planner:

LLM-as-a-Judge Prompt Template

You are an expert evaluator judging whether a travel planner's
response is correct. The planner must create a trip plan with:
(1) essential info, (2) budget breakdown, and (3) local experiences.

CORRECT - The response:
- Accurately addresses the user's destination, duration, and travel style
- Includes essential travel info (weather, key attractions, etiquette)
- Mentions budget or cost details
- No major logical errors

INCORRECT - The response contains any of:
- Major errors about the destination, costs, or local info
- Missing essential info when a full trip plan was requested
- Wrong destination, duration, or travel style addressed

[BEGIN DATA]
************
[User Input]:
{input_user}

************
[Travel Plan]:
{output}
************
[END DATA]

Is the output correct or incorrect?

Define labels: Set the output labels that constrain what the judge can return:
- correct (score: 1) — the response meets all criteria
- incorrect (score: 0) — the response fails one or more criteria
Enable explanations: Toggle Explanations to On so the judge provides a brief rationale for each label. This helps you understand why specific itineraries passed or failed.

For a deeper dive into building custom evaluators, see Create a custom LLM-as-a-Judge eval.

Step 5: Run the Experiment

With your prompt, dataset, and evaluator in place, you’re ready to run a Playground Experiment. Click Run to execute the experiment. The Playground will:

Iterate through each row in your dataset
Fill in the template variables with the row’s values
Send the completed prompt to your selected model
Run each evaluator on the generated output

Click View Experiment to see the detailed results.

Step 6: Review Results

In the experiment results view, you can:

Compare outputs across examples: See how the prompt handles a 2-week Istanbul trip versus a long weekend in Reykjavik
Check evaluator scores: Filter by evaluation labels to find which examples passed or failed
Read judge explanations: Understand why specific outputs were scored the way they were
Identify patterns: Look for systematic issues — does the prompt struggle with longer trips? Does it miss budget constraints for certain travel styles?

Pay attention to edge cases in the dataset. The trip planner examples include a range of durations (weekend to 1 month), travel styles (solo, family, romantic, group), and destinations. If the prompt performs well on most but struggles with specific combinations, that’s a signal for optimization.

Compare Multiple Prompt Versions

If you want to test a revised prompt side-by-side with the original, use the + button to add another prompt object. You can compare up to 3 prompts simultaneously. This is useful for quickly validating whether a change improves or regresses performance. See Test multiple prompts at once for more details.

Next Steps

You’ve now tested your trip planner prompt across a diverse dataset and have structured evaluation data showing:

Which examples the prompt handles well
Where it struggles or produces outputs that don’t meet quality criteria
Specific patterns in failures that can guide improvements

In the next step, you’ll use Prompt Optimization to automatically improve your prompt based on this evaluation data.

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

Test Your Prompt in the Playground

Step 1: Create Your Dataset

Download Examples

Download Trip Planner Dataset

Create Dataset in Arize AX

Step 2: Open the Playground and Load Your Prompt

Step 3: Attach Your Dataset

Step 4: Add Evaluators

Create a Custom LLM-as-a-Judge Evaluator

Step 5: Run the Experiment

Step 6: Review Results

Compare Multiple Prompt Versions

Next Steps

Optimize Your Prompt

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

Documentation Index

​Step 1: Create Your Dataset

​Download Examples

Download Trip Planner Dataset

​Create Dataset in Arize AX

​Step 2: Open the Playground and Load Your Prompt

​Step 3: Attach Your Dataset

​Step 4: Add Evaluators

​Create a Custom LLM-as-a-Judge Evaluator

​Step 5: Run the Experiment

​Step 6: Review Results

​Compare Multiple Prompt Versions

​Next Steps

Optimize Your Prompt

Step 1: Create Your Dataset

Download Examples

Create Dataset in Arize AX

Step 2: Open the Playground and Load Your Prompt

Step 3: Attach Your Dataset

Step 4: Add Evaluators

Create a Custom LLM-as-a-Judge Evaluator

Step 5: Run the Experiment

Step 6: Review Results

Compare Multiple Prompt Versions

Next Steps