This is Part 4 of the Arize AX Get Started series. You should have completed the Prompts guide first, with an improved prompt saved in Prompt Hub.
Step 1: Create a dataset
A dataset is a collection of test cases you’ll run through your chatbot. Good datasets include a mix of:- Common questions your users actually ask
- Edge cases where the answer depends on specific policy details
- Known failures from production traces where your chatbot got it wrong
| input | expected_output |
|---|---|
| Can I get a refund on my non-refundable ticket? | No cash refund, but a travel credit is issued minus a $75 change fee. Credits expire in 12 months. |
| How much does a second checked bag cost? | $45 on all fare types. |
| I’m a Platinum member. Can I change my Basic fare for free? | Yes, Platinum members get free changes on all fares. |
| My flight was delayed 3 hours. What compensation do I get? | A $50 travel voucher for future SkyServe flights. |
- Navigate to Datasets in the left sidebar
- Click + New Dataset
- Upload your CSV file
- Give it a name like
skyserve-test-cases

Step 2: Run your old prompt as a baseline
Before testing the new prompt, you need a baseline — how did the old prompt perform on these test cases?- Navigate to your dataset and click Open in Playground
- In the Playground, enter your original system prompt (the simple “be friendly and helpful” version)
- Click Run to execute the prompt against every row in the dataset


Step 3: Run your new prompt
Now do the same thing with your improved prompt:- In the Playground, load your
skyserve-supportprompt from Prompt Hub (or paste in the improved system prompt with grounding instructions) - Click Run to execute against the same dataset
- Save as a new experiment:
improved-grounded-prompt

Step 4: Add an evaluator
To compare the experiments objectively, add an evaluator that scores the results.- Navigate to your dataset’s experiments view
- Click Add Evaluator
- Select your
groundedness-checkevaluator from the Eval Hub (the same one you created in the Evaluations guide) - You can also add a Helpfulness evaluator — select it from the pre-built templates to measure whether the new prompt’s answers are still useful

Step 5: Compare experiments
Once the evaluators finish, click Compare or enable Diff Mode to see the results side by side.
- Groundedness scores improved — fewer responses contain information not in the policy documents
- Helpfulness scores stayed the same (or improved) — the grounding instructions didn’t make the chatbot unhelpfully cautious
- Individual responses you can click into to see exactly what changed between the old and new prompt

Congratulations!
You’ve completed the full development loop:- Traced your app to see what’s happening inside it
- Evaluated responses automatically to measure quality
- Improved your prompt using real failure data in the Playground
- Proved the improvement works across a representative dataset with experiments