Skip to main content
In the previous guide, you improved your chatbot’s system prompt by adding grounding instructions. The fix looked good on the traces you tested — but prompt changes can have unintended side effects. Your new grounding rules might make the chatbot too conservative, refusing to answer questions it used to handle correctly. You need to test the new prompt against a representative set of queries — not just the few you happened to check — and measure the difference. Did groundedness scores improve? Did helpfulness stay the same? Are there regressions? Experiments let you run the same set of inputs through different versions of your app, score the results, and compare them side by side. They turn “I think it’s better” into “I can prove it’s better.”
This is Part 4 of the Arize AX Get Started series. You should have completed the Prompts guide first, with an improved prompt saved in Prompt Hub.

Step 1: Create a dataset

A dataset is a collection of test cases you’ll run through your chatbot. Good datasets include a mix of:
  • Common questions your users actually ask
  • Edge cases where the answer depends on specific policy details
  • Known failures from production traces where your chatbot got it wrong
Download this sample CSV from the companion notebook, or create your own. Here’s what it looks like:
inputexpected_output
Can I get a refund on my non-refundable ticket?No cash refund, but a travel credit is issued minus a $75 change fee. Credits expire in 12 months.
How much does a second checked bag cost?$45 on all fare types.
I’m a Platinum member. Can I change my Basic fare for free?Yes, Platinum members get free changes on all fares.
My flight was delayed 3 hours. What compensation do I get?A $50 travel voucher for future SkyServe flights.
To upload it:
  1. Navigate to Datasets in the left sidebar
  2. Click + New Dataset
  3. Upload your CSV file
  4. Give it a name like skyserve-test-cases
New Dataset dialog showing CSV upload with preview of test cases
You can also create datasets from your existing traces — selecting specific spans to include. This is a great way to build regression datasets from real production failures.

Step 2: Run your old prompt as a baseline

Before testing the new prompt, you need a baseline — how did the old prompt perform on these test cases?
  1. Navigate to your dataset and click Open in Playground
  2. In the Playground, enter your original system prompt (the simple “be friendly and helpful” version)
  3. Click Run to execute the prompt against every row in the dataset
Playground with original prompt and skyserve-test-cases dataset loaded
Once the run completes, the experiment is automatically saved. Navigate to your dataset’s Experiments tab to see it.
Experiments tab showing baseline-original-prompt experiment

Step 3: Run your new prompt

Now do the same thing with your improved prompt:
  1. In the Playground, load your skyserve-support prompt from Prompt Hub (or paste in the improved system prompt with grounding instructions)
  2. Click Run to execute against the same dataset
  3. Save as a new experiment: improved-grounded-prompt
Playground with improved skyserve-support prompt and dataset ready to run
You now have two experiments on the same dataset — one for each prompt version.

Step 4: Add an evaluator

To compare the experiments objectively, add an evaluator that scores the results.
  1. Navigate to your dataset’s experiments view
  2. Click Add Evaluator
  3. Select your groundedness-check evaluator from the Eval Hub (the same one you created in the Evaluations guide)
  4. You can also add a Helpfulness evaluator — select it from the pre-built templates to measure whether the new prompt’s answers are still useful
Add Evaluator flow showing available evaluators from the hub
AX will run the evaluators against both experiments in the background.

Step 5: Compare experiments

Once the evaluators finish, click Compare or enable Diff Mode to see the results side by side.
Compare Experiments view showing two experiments side by side
You should see:
  • Groundedness scores improved — fewer responses contain information not in the policy documents
  • Helpfulness scores stayed the same (or improved) — the grounding instructions didn’t make the chatbot unhelpfully cautious
  • Individual responses you can click into to see exactly what changed between the old and new prompt
Experiment comparison detail showing old vs new response for a single input
If you see a regression — a test case where the new prompt performs worse — you can click into it, load it in the Playground, and iterate further. This is the development loop: trace → evaluate → improve → experiment → repeat.

Congratulations!

You’ve completed the full development loop:
  1. Traced your app to see what’s happening inside it
  2. Evaluated responses automatically to measure quality
  3. Improved your prompt using real failure data in the Playground
  4. Proved the improvement works across a representative dataset with experiments
You now have a repeatable, data-driven process for improving your LLM application. No more guessing, no more hoping — you can measure quality and demonstrate improvement. But your chatbot isn’t just a dev project. It’s serving real customers, and LLM apps can degrade in surprising ways — model provider changes, new types of questions, traffic spikes. You need to know the moment something goes wrong in production. Next up: We’ll set up dashboards and monitors so you have a real-time view of your app’s health — and get alerted the moment quality drops.

Next: Stay On Top of Production

Learn more about Experiments