> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Improve Your Agent

> Use production failures to improve your prompt, then prove the fix works across your dataset

In the previous guide, the groundedness [evaluator](/ax/evaluate/create-evaluators) revealed a pattern: the chatbot makes claims not in the policy documents - the system prompt says "be helpful" but doesn't enforce grounding. Rather than guessing at a fix and redeploying, start from a real failure, fix it in Playground using the exact inputs that went wrong, then validate across a full dataset before shipping.

<Frame>
  <img src="https://mintcdn.com/arize-ax/uRr2KzrXrYRcZ2Xl/images/get-started/alyx-skyserve-low-groundedness-traces.png?fit=max&auto=format&n=uRr2KzrXrYRcZ2Xl&q=85&s=82b4c026b00dec7523e979f704f64780" alt="Arize AX skyserve-chatbot Traces view with Alyx assistant open, user request about groundedness-check failures this week, and Alyx task plan and progress" width="1024" height="586" data-path="images/get-started/alyx-skyserve-low-groundedness-traces.png" />
</Frame>

<Info>
  This is **Part 3** of the Arize AX Get Started series. You should have completed the [Evaluations guide](/ax/get-started/get-started-evaluations) first, with [evaluation](/ax/evaluate/run-evals-on-traces#viewing-results) scores visible on your traces.
</Info>

## Choose how you want to work

Use [Arize Skills](/ax/agents/arize-skills) to have your coding agent run improvement workflows from your editor, [Alyx](/ax/alyx) for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or **Code** to run programmatically.
In each path, you'll build a dataset from failing traces, iterate on your prompt, and compare experiments before shipping.

<Tabs>
  <Tab title="By Arize Skills">
    Use [Arize Skills](/ax/agents/arize-skills) with your coding agent to run the same workflow from your editor. The example prompts below are what you type to your agent — the skill loads automatically and handles the rest. Install the skills plugin and follow [Set up Arize with AI coding agents](/ax/set-up-with-ai-assistants) for authentication and CLI setup.

    ### Step 1: See evaluation results on your traces

    [`arize-trace`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-trace/SKILL.md)

    Spans include labels once an [eval task](/ax/evaluate/run-evals-on-traces#create-a-task) has run; see [Viewing results](/ax/evaluate/run-evals-on-traces#viewing-results) in the tracing UI.

    For example, you might say:

    > Export spans from skyserve-chatbot where groundedness-check failed this week

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/SCR-20260421-jwqy.png" alt="Terminal showing ax spans export command, export success message, summary of spans with low groundedness flagged as hallucinated, and a table of span and trace IDs with evaluator columns" />
    </Frame>

    ### Step 2: Create a dataset

    [`arize-dataset`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-dataset/SKILL.md)

    For example, you might say:

    > Create skyserve-test-cases from those failing traces

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/failingtrace.png" alt="Terminal: dataset skyserve-test-cases created from failing traces, with schema fields question, reference_text, original_output, trace and span IDs, and status counts" />
    </Frame>

    ### Step 3: Improve the system prompt

    [`arize-prompt-optimization`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-prompt-optimization/SKILL.md)

    For example, you might say:

    > Extract the system prompt from the failing skyserve-chatbot spans and generate an improved version. Use the groundedness-check eval labels and explanations as signal for what to fix.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/Screenshot%202026-04-23%20at%203.04.12%E2%80%AFPM.png" alt="Improved SkyServe system prompt with grounding rules and notes referencing groundedness-check eval labels and span-level failures" />
    </Frame>

    ### Step 4: Run both prompts as experiments

    [`arize-experiment`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-experiment/SKILL.md)

    Reuse the same [evaluators](/ax/evaluate/create-evaluators) you trust in production; see [Run evals on experiments](/ax/evaluate/run-evals-on-experiments).

    For example, you might say:

    > Run both prompt versions (original and the updated one) against the dataset and compare groundedness scores.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/Screenshot%202026-04-23%20at%203.46.14%E2%80%AFPM.png" alt="Experiment comparison table: skyserve-original-prompt at 4/5 groundedness versus skyserve-improved-prompt at 5/5 (100%)" />
    </Frame>
  </Tab>

  <Tab title="By Alyx">
    Use [Alyx](/ax/alyx) from [**Traces**](/ax/observe/tracing), the [**Prompt Playground**](/ax/prompts/prompt-playground), [**Datasets**](/ax/develop/datasets-and-experiments), or [**Experiments**](/ax/develop/datasets-and-experiments/compare-experiments) to run the same workflow in conversation. Then, follow the flow below.

    <Frame>
      <img src="https://mintcdn.com/arize-ax/uRr2KzrXrYRcZ2Xl/images/get-started/alyx-skyserve-low-groundedness-traces.png?fit=max&auto=format&n=uRr2KzrXrYRcZ2Xl&q=85&s=82b4c026b00dec7523e979f704f64780" alt="Arize AX skyserve-chatbot Traces view with Alyx assistant open, user request about groundedness-check failures this week, and Alyx task plan and progress" width="1024" height="586" data-path="images/get-started/alyx-skyserve-low-groundedness-traces.png" />
    </Frame>

    ### Step 1: See evaluation results on your traces

    Ask about traces that already have [evaluation](/ax/evaluate/run-evals-on-traces#viewing-results) columns from your [eval tasks](/ax/evaluate/run-evals-on-traces#create-a-task).

    For example, you might say:

    > Show me traces where groundedness-check failed this week and explain what went wrong

    ### Step 2: Create a dataset

    For example, you might say:

    > Create skyserve-test-cases from those failing traces

    ### Step 3: Improve the system prompt

    Grounding failures should line up with [evaluator](/ax/evaluate/create-evaluators) labels from [online evals on traces](/ax/evaluate/run-evals-on-traces).

    For example, you might say:

    > Extract the system prompt from that trace and suggest an improved version based on the groundedness failures

    ### Step 4: Run both prompts as experiments

    Compare runs with the same [evaluators](/ax/evaluate/create-evaluators) you use in production; see [Run evals on experiments](/ax/evaluate/run-evals-on-experiments).

    For example, you might say:

    > Run both prompt versions (original and the updated one) against the dataset and compare groundedness scores.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/Screenshot%202026-04-23%20at%203.36.07%E2%80%AFPM.png" alt="Playground compare view: versions A and B on skyserve-test-cases with groundedness scores and per-row grounded versus ungrounded labels" />
    </Frame>

    ### Step 5: Save to Prompt Hub

    When you are happy with the improved prompt, ask Alyx to save it to [Prompt Hub](/ax/prompts/prompt-hub) so you get a named template, version history, and rollbacks - the same outcome as clicking **Save to Prompt Hub** in the Playground UI.

    For example, you might say:

    > Save the improved system prompt from this Playground to Prompt Hub as skyserve-support. Use a version description like: added explicit grounding rules so the model refuses when the policy docs do not support an answer.
  </Tab>

  <Tab title="By UI">
    Follow these steps in the Arize AX UI: find failing traces, replay them in the [Prompt Playground](/ax/prompts/prompt-playground), tighten your prompt, build a dataset, run [experiments](/ax/develop/datasets-and-experiments/compare-experiments), compare [evals](/ax/evaluate/run-evals-on-experiments), and save to [Prompt Hub](/ax/prompts/prompt-hub).

    ### Step 1: See evaluation results on your traces

    Go to your **skyserve-chatbot** project and filter traces by the **groundedness-check** [evaluation](/ax/evaluate/run-evals-on-traces#viewing-results) score. Find a trace that failed — one where the chatbot made up information not in the policy documents — and click in to see what went wrong.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/12-traces-filtered-by-eval.png" alt="Traces filtered by groundedness evaluation showing hallucinated traces" />
    </Frame>

    ### Step 2: Replay in Prompt Playground

    Click **Open in Playground** on the span. AX automatically populates the system prompt, user message, and model settings that produced the bad answer — no manual setup needed.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/evaluate/llm%20span.png" alt="Trace detail for an LLM ChatCompletion span showing trace tree, span evaluations, Open in Playground, and Input Output tab with model and system prompt" />
    </Frame>

    <br />

    <Frame caption="Your LLM span prompt loaded into playground">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/13-playground-loaded-from-trace.png" alt="Prompt Playground auto-populated from a trace with system prompt, user message, and model" />
    </Frame>

    ### Step 3: Improve the system prompt

    The original is too loose:

    ```
    You are SkyServe Airlines' customer service assistant.
    Answer the customer's question based on the provided policy documents.
    Be friendly and helpful.
    ```

    Tighten it with explicit grounding rules — for example, require that every claim reference a specific policy, and instruct the model to say "I don't have that information" rather than guess. Click **Run** to confirm the response improves.

    <Frame caption="Update your prompt with explicit grounding rules">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/14-playground-before-after.png" alt="Playground with improved system prompt and new grounded response" />
    </Frame>

    ### Step 4: Create a dataset and run experiments

    A few spot-checks aren't enough. Create a dataset of representative test cases (common questions, edge cases, known failures) and run both prompt versions against it as experiments — one baseline, one improved. In **Datasets**, add examples (upload a CSV or build from traces) and open the dataset in Playground. Run your original prompt as the baseline experiment, then run your improved prompt on the same inputs.

    <Frame caption="Upload dataset:">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/18-dataset-upload.png" alt="New Dataset dialog showing CSV upload with preview of test cases" />
    </Frame>

    <br />

    <Frame caption="View your experiment">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/20-save-as-experiment.png" alt="Experiments tab showing baseline-original-prompt experiment" />
    </Frame>

    <br />

    <Frame caption="Select rows and run an experiment on your data">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/21-playground-run-new-prompt.png" alt="Playground with improved prompt and dataset ready to run" />
    </Frame>

    ### Step 5: Evaluate and compare

    Add your **groundedness-check** [evaluator](/ax/evaluate/create-evaluators) to both experiments (the same one you created in the [Evaluations guide](/ax/get-started/get-started-evaluations)) and use **Compare** mode to view results side by side. You can add a Helpfulness [evaluator](/ax/evaluate/create-evaluators) from the templates to check that answers stay useful. You should see groundedness improve while helpfulness stays flat. If you see a regression, iterate in Playground.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/22-experiment-add-evaluator.png" alt="Add Evaluator flow showing available evaluators from the hub" />
    </Frame>

    <br />

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/23-experiment_comparison.png" alt="Compare Experiments view showing two experiments side by side" />
    </Frame>

    ### Step 6: Save to Prompt Hub

    Once satisfied, click **Save to Prompt Hub**, name it `skyserve-support`, and add a version description. Your prompt is now versioned — your team can see the full history, compare versions, and roll back if needed.

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/16-save-to-prompt-hub.png" alt="Save to Prompt Hub dialog with name, description, and version description" />
    </Frame>

    <br />

    <Frame>
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/get-started-images/17-prompt-hub-version-history.png" alt="Prompt Hub showing skyserve-support version history and prompt template" />
    </Frame>
  </Tab>

  <Tab title="By Code">
    Run this workflow from the [Python SDK](/api-clients/python/overview), [TypeScript SDK](/api-clients/typescript/version-1/overview), or [`ax` CLI](/api-clients/cli/overview). Some features are in alpha or beta - please check individual reference pages for details.

    | Step                         | Python SDK                                                         | TypeScript SDK                                                         | CLI                                  |
    | ---------------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------- | ------------------------------------ |
    | Filter spans by eval result  | [Link](/api-clients/python/version-8/client-resources/spans)       | [Link](/api-clients/typescript/version-1/client-resources/spans)       | [Link](/api-clients/cli/spans)       |
    | Create a dataset from traces | [Link](/api-clients/python/version-8/client-resources/datasets)    | [Link](/api-clients/typescript/version-1/client-resources/datasets)    | [Link](/api-clients/cli/datasets)    |
    | Manage prompts               | [Link](/api-clients/python/version-8/client-resources/prompts)     | [Link](/api-clients/typescript/version-1/client-resources/prompts)     | [Link](/api-clients/cli/prompts)     |
    | Run experiments              | [Link](/api-clients/python/version-8/client-resources/experiments) | [Link](/api-clients/typescript/version-1/client-resources/experiments) | [Link](/api-clients/cli/experiments) |
  </Tab>
</Tabs>

## Congratulations!

You've completed the full improvement loop:

1. Traced your app to see what's happening inside it.
2. Evaluated responses automatically to measure quality.
3. Improved your prompt using real failure data in the Playground.
4. Proved the improvement works across a representative dataset with experiments.

You now have a repeatable, data-driven process for improving your LLM application. No more guessing, no more hoping - you can measure quality and demonstrate improvement.

**Next up:** Deepen your tracing foundation so your improvement loop stays grounded in complete, high-quality telemetry.

<CardGroup cols={2}>
  <Card title="Next: Tracing concepts" icon="arrow-right" href="/ax/instrument/what-are-traces" />

  <Card title="Learn more about Experiments" icon="book-open" href="/ax/evaluate/run-evals-on-experiments" />
</CardGroup>
