> ## Documentation Index > Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt > Use this file to discover all available pages before exploring further. # Improve your agent > Use production failures to improve your prompt, then prove the fix works across your dataset In the previous guide, your [evaluator](/docs/ax/evaluate/create-evaluators) revealed a pattern. Take a common one: your app asserts things it cannot back up with a source. Your system prompt might say "be helpful," but it never tells the agent to stick to the information it has or to admit when it doesn't know. Whatever failure pattern your own evaluator surfaced, the workflow is the same: rather than guessing at a fix and redeploying, start from a real failure, fix it in Playground using the exact inputs that went wrong, then validate across a full dataset before shipping. Arize AX Traces view with Alyx assistant open, a request about evaluator failures this week, and Alyx task plan and progress

Arize AX Traces view with Alyx assistant open, a request about evaluator failures this week, and Alyx task plan and progress

This is **Part 3** of the Arize AX Get Started series. You should have completed the [Evaluations guide](/docs/ax/get-started/get-started-evaluations) first, with [evaluation](/docs/ax/evaluate/run-evals-on-traces#viewing-results) scores visible on your traces. ## Choose how you want to work Use [Arize Skills](/docs/ax/agents/arize-skills) to have your coding agent run improvement workflows from your editor, [Alyx](/docs/ax/alyx) for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or **Code** to run programmatically. In each path, you'll build a dataset from failing traces, iterate on your prompt, and compare experiments before shipping. Use [Arize Skills](/docs/ax/agents/arize-skills) with your coding agent to run the same workflow from your editor. The example prompts below are what you type to your agent. The skill loads automatically and handles the rest. Install the skills plugin and follow [Set up Arize with AI coding agents](/docs/ax/set-up-with-ai-assistants) for authentication and CLI setup. ### Step 1: See evaluation results on your traces [`arize-trace`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-trace/SKILL.md) Spans include labels once an [eval task](/docs/ax/evaluate/run-evals-on-traces#create-a-task) has run; see [Viewing results](/docs/ax/evaluate/run-evals-on-traces#viewing-results) in the tracing UI. For example, you might say: > Export spans from my project where my evaluator failed this week Terminal showing ax spans export command, export success message, summary of flagged spans, and a table of span and trace IDs with evaluator columns

Terminal showing ax spans export command, export success message, summary of flagged spans, and a table of span and trace IDs with evaluator columns

### Step 2: Create a dataset [`arize-dataset`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-dataset/SKILL.md) For example, you might say: > Create a test dataset from those failing traces Terminal: a test-cases dataset created from failing traces, with schema fields for input, reference text, original output, trace and span IDs, and status counts

Terminal: a test-cases dataset created from failing traces, with schema fields for input, reference text, original output, trace and span IDs, and status counts

### Step 3: Improve the system prompt [`arize-prompt-optimization`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-prompt-optimization/SKILL.md) For example, you might say: > Extract the system prompt from the failing spans and generate an improved version. Use my evaluator's labels and explanations as signal for what to fix. Improved system prompt with notes referencing evaluator labels and span-level failures

Improved system prompt with notes referencing evaluator labels and span-level failures

### Step 4: Run both prompts as experiments [`arize-experiment`](https://github.com/Arize-ai/arize-skills/blob/main/skills/arize-experiment/SKILL.md) Reuse the same [evaluators](/docs/ax/evaluate/create-evaluators) you trust in production; see [Run evals on experiments](/docs/ax/evaluate/run-evals-on-experiments). For example, you might say: > Run both prompt versions (original and the updated one) against the dataset and compare my evaluator's scores. Experiment comparison table: original prompt at 4/5 versus improved prompt at 5/5 (100%)

Experiment comparison table: original prompt at 4/5 versus improved prompt at 5/5 (100%)

Use [Alyx](/docs/ax/alyx) from [**Traces**](/docs/ax/observe/tracing/view-and-manage-traces), the [**Prompt Playground**](/docs/ax/prompts/prompt-playground), [**Datasets**](/docs/ax/develop/datasets-and-experiments), or [**Experiments**](/docs/ax/develop/datasets-and-experiments/compare-experiments) to run the same workflow in conversation. Then, follow the flow below. Arize AX Traces view with Alyx assistant open, a request about evaluator failures this week, and Alyx task plan and progress

### Step 1: See evaluation results on your traces Ask about traces that already have [evaluation](/docs/ax/evaluate/run-evals-on-traces#viewing-results) columns from your [eval tasks](/docs/ax/evaluate/run-evals-on-traces#create-a-task). For example, you might say: > Show me traces where my evaluator failed this week and explain what went wrong ### Step 2: Create a dataset For example, you might say: > Create a test dataset from those failing traces ### Step 3: Improve the system prompt The failures you fix should line up with the [evaluator](/docs/ax/evaluate/create-evaluators) labels from your [online evals on traces](/docs/ax/evaluate/run-evals-on-traces). For example, you might say: > Extract the system prompt from that trace and suggest an improved version based on the failures my evaluator flagged ### Step 4: Run both prompts as experiments Compare runs with the same [evaluators](/docs/ax/evaluate/create-evaluators) you use in production; see [Run evals on experiments](/docs/ax/evaluate/run-evals-on-experiments). For example, you might say: > Run both prompt versions (original and the updated one) against the dataset and compare my evaluator's scores. Playground compare view: versions A and B on a test-cases dataset with evaluator scores and per-row labels

Playground compare view: versions A and B on a test-cases dataset with evaluator scores and per-row labels

### Step 5: Save to Prompt Hub When you are happy with the improved prompt, ask Alyx to save it to [Prompt Hub](/docs/ax/prompts/prompt-hub) so you get a named template, version history, and rollbacks - the same outcome as clicking **Save to Prompt Hub** in the Playground UI. For example, you might say: > Save the improved system prompt from this Playground to Prompt Hub. Use a version description like: added explicit rules so the model only answers from the information it's given. Follow these steps in the Arize AX UI: find failing traces, replay them in the [Prompt Playground](/docs/ax/prompts/prompt-playground), tighten your prompt, build a dataset, run [experiments](/docs/ax/develop/datasets-and-experiments/compare-experiments), compare [evals](/docs/ax/evaluate/run-evals-on-experiments), and save to [Prompt Hub](/docs/ax/prompts/prompt-hub). ### Step 1: See evaluation results on your traces Go to your project and filter traces by your [evaluator](/docs/ax/evaluate/run-evals-on-traces#viewing-results)'s score. Find a trace that failed, for example one where your app made up information it was never given, and click in to see what went wrong. Traces filtered by an evaluation score showing flagged traces

Traces filtered by an evaluation score showing flagged traces

### Step 2: Replay in Prompt Playground Click **Open in Playground** on the span. AX automatically populates the system prompt, user message, and model settings that produced the bad answer, no manual setup needed. Trace detail for an LLM ChatCompletion span showing trace tree, span evaluations, Open in Playground, and Input Output tab with model and system prompt

Trace detail for an LLM ChatCompletion span showing trace tree, span evaluations, Open in Playground, and Input Output tab with model and system prompt

Prompt Playground auto-populated from a trace with system prompt, user message, and model

### Step 3: Improve the system prompt The original is too loose. Try something like: ``` You are a helpful assistant. Answer the user's question based on the provided context. Be friendly and helpful. ``` Tighten it with explicit grounding rules: for example, require that every claim be supported by the context the app was given, and instruct the model to say "I don't have that information" rather than guess. Click **Run** to confirm the response improves. Playground with improved system prompt and new response

Playground with improved system prompt and new response

### Step 4: Create a dataset and run experiments A few spot-checks aren't enough. Create a dataset of representative test cases (common questions, edge cases, known failures) and run both prompt versions against it as experiments: one baseline, one improved. In **Datasets**, add examples (upload a CSV or build from traces) and open the dataset in Playground. Run your original prompt as the baseline experiment, then run your improved prompt on the same inputs. New Dataset dialog showing CSV upload with preview of test cases

New Dataset dialog showing CSV upload with preview of test cases

Experiments tab showing baseline-original-prompt experiment

Playground with improved prompt and dataset ready to run

### Step 5: Evaluate and compare Add your [evaluator](/docs/ax/evaluate/create-evaluators) to both experiments (the same one you created in the [Evaluations guide](/docs/ax/get-started/get-started-evaluations)) and use **Compare** mode to view results side by side. It's worth adding a second evaluator from the templates (for example Helpfulness) to check that fixing one problem didn't create another. You should see the metric you targeted improve while the others stay flat. If you see a regression, iterate in Playground. Add Evaluator flow showing available evaluators from the hub

Add Evaluator flow showing available evaluators from the hub

Compare Experiments view showing two experiments side by side

### Step 6: Save to Prompt Hub Once satisfied, click **Save to Prompt Hub**, give it a name, and add a version description. Your prompt is now versioned. Your team can see the full history, compare versions, and roll back if needed. Save to Prompt Hub dialog with name, description, and version description

Save to Prompt Hub dialog with name, description, and version description

Prompt Hub showing a prompt's version history and template

Run this workflow from the [Python SDK](/docs/api-clients/python/overview), [TypeScript SDK](/docs/api-clients/typescript/version-1/overview), or [`ax` CLI](/docs/api-clients/cli/overview). Some features are in alpha or beta - please check individual reference pages for details. | Step | Python SDK | TypeScript SDK | CLI | | ---------------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------- | ------------------------------------ | | Filter spans by eval result | [Link](/docs/api-clients/python/version-8/client-resources/spans) | [Link](/docs/api-clients/typescript/version-1/client-resources/spans) | [Link](/docs/api-clients/cli/spans) | | Create a dataset from traces | [Link](/docs/api-clients/python/version-8/client-resources/datasets) | [Link](/docs/api-clients/typescript/version-1/client-resources/datasets) | [Link](/docs/api-clients/cli/datasets) | | Manage prompts | [Link](/docs/api-clients/python/version-8/client-resources/prompts) | [Link](/docs/api-clients/typescript/version-1/client-resources/prompts) | [Link](/docs/api-clients/cli/prompts) | | Run experiments | [Link](/docs/api-clients/python/version-8/client-resources/experiments) | [Link](/docs/api-clients/typescript/version-1/client-resources/experiments) | [Link](/docs/api-clients/cli/experiments) | ## Congratulations! You've completed the full improvement loop: 1. Traced your app to see what's happening inside it. 2. Evaluated responses automatically to measure quality. 3. Improved your prompt using real failure data in the Playground. 4. Proved the improvement works across a representative dataset with experiments. You now have a repeatable, data-driven process for improving your LLM application. No more guessing, no more hoping - you can measure quality and demonstrate improvement. **Next up:** Deepen your tracing foundation so your improvement loop stays grounded in complete, high-quality telemetry.