Improve your agent

In the previous guide, your evaluator revealed a pattern. Take a common one: your app asserts things it cannot back up with a source. Your system prompt might say “be helpful,” but it never tells the agent to stick to the information it has or to admit when it doesn’t know. Whatever failure pattern your own evaluator surfaced, the workflow is the same: rather than guessing at a fix and redeploying, start from a real failure, fix it in Playground using the exact inputs that went wrong, then validate across a full dataset before shipping.

Arize AX Traces view with Alyx assistant open, a request about evaluator failures this week, and Alyx task plan and progress — Reviewing the traces an evaluator flagged, with Alyx open to help triage.

This is Part 3 of the Arize AX Get Started series. You should have completed the Evaluations guide first, with evaluation scores visible on your traces.

Choose how you want to work

Use Arize Skills to have your coding agent run improvement workflows from your editor, Alyx for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or Code to run programmatically. In each path, you’ll build a dataset from failing traces, iterate on your prompt, and compare experiments before shipping.

By Arize Skills
By Alyx
By UI
By Code

Use Arize Skills with your coding agent to run the same workflow from your editor. The example prompts below are what you type to your agent. The skill loads automatically and handles the rest. Install the skills plugin and follow Set up Arize with AI coding agents for authentication and CLI setup.

Step 1: See evaluation results on your traces

arize-traceSpans include labels once an eval task has run; see Viewing results in the tracing UI.For example, you might say:

Export spans from my project where my evaluator failed this week

Terminal showing ax spans export command, export success message, summary of flagged spans, and a table of span and trace IDs with evaluator columns — Exporting the spans an evaluator flagged as failures for triage.

Step 2: Create a dataset

arize-datasetFor example, you might say:

Create a test dataset from those failing traces

Terminal: a test-cases dataset created from failing traces, with schema fields for input, reference text, original output, trace and span IDs, and status counts — A test dataset created from the failing traces.

Step 3: Improve the system prompt

arize-prompt-optimizationFor example, you might say:

Extract the system prompt from the failing spans and generate an improved version. Use my evaluator’s labels and explanations as signal for what to fix.

Improved system prompt with notes referencing evaluator labels and span-level failures — An improved system prompt generated from the failing spans, using the evaluator's labels as signal.

Step 4: Run both prompts as experiments

arize-experimentReuse the same evaluators you trust in production; see Run evals on experiments.For example, you might say:

Run both prompt versions (original and the updated one) against the dataset and compare my evaluator’s scores.

Experiment comparison table: original prompt at 4/5 versus improved prompt at 5/5 (100%) — An experiment comparison showing the improved prompt scoring higher than the original on the evaluator.

Follow these steps in the Arize AX UI: find failing traces, replay them in the Prompt Playground, tighten your prompt, build a dataset, run experiments, compare evals, and save to Prompt Hub.

Step 1: See evaluation results on your traces

Go to your project and filter traces by your evaluator’s score. Find a trace that failed, for example one where your app made up information it was never given, and click in to see what went wrong.

Traces filtered by an evaluation score showing flagged traces — Traces filtered to the ones an evaluator flagged as failures.

Step 2: Replay in Prompt Playground

Click Open in Playground on the span. AX automatically populates the system prompt, user message, and model settings that produced the bad answer, no manual setup needed.

Trace detail for an LLM ChatCompletion span showing trace tree, span evaluations, Open in Playground, and Input Output tab with model and system prompt

Prompt Playground auto-populated from a trace with system prompt, user message, and model — Your LLM span prompt loaded into playground

Step 3: Improve the system prompt

The original is too loose. Try something like:

You are a helpful assistant.
Answer the user's question based on the provided context.
Be friendly and helpful.

Tighten it with explicit grounding rules: for example, require that every claim be supported by the context the app was given, and instruct the model to say “I don’t have that information” rather than guess. Click Run to confirm the response improves.

Playground with improved system prompt and new response — Tightening the prompt to fix the failure with explicit grounding rules.

Step 4: Create a dataset and run experiments

A few spot-checks aren’t enough. Create a dataset of representative test cases (common questions, edge cases, known failures) and run both prompt versions against it as experiments: one baseline, one improved. In Datasets, add examples (upload a CSV or build from traces) and open the dataset in Playground. Run your original prompt as the baseline experiment, then run your improved prompt on the same inputs.

New Dataset dialog showing CSV upload with preview of test cases — Upload dataset:

Experiments tab showing baseline-original-prompt experiment — View your experiment

Playground with improved prompt and dataset ready to run — Select rows and run an experiment on your data

Step 5: Evaluate and compare

Add your evaluator to both experiments (the same one you created in the Evaluations guide) and use Compare mode to view results side by side. It’s worth adding a second evaluator from the templates (for example Helpfulness) to check that fixing one problem didn’t create another. You should see the metric you targeted improve while the others stay flat. If you see a regression, iterate in Playground.

Add Evaluator flow showing available evaluators from the hub

Compare Experiments view showing two experiments side by side

Step 6: Save to Prompt Hub

Once satisfied, click Save to Prompt Hub, give it a name, and add a version description. Your prompt is now versioned. Your team can see the full history, compare versions, and roll back if needed.

Save to Prompt Hub dialog with name, description, and version description

Prompt Hub showing a prompt's version history and template

Run this workflow from the Python SDK, TypeScript SDK, or ax CLI. Some features are in alpha or beta - please check individual reference pages for details.

Step	Python SDK	TypeScript SDK	CLI
Filter spans by eval result	Link	Link	Link
Create a dataset from traces	Link	Link	Link
Manage prompts	Link	Link	Link
Run experiments	Link	Link	Link

Congratulations!

You’ve completed the full improvement loop:

Traced your app to see what’s happening inside it.
Evaluated responses automatically to measure quality.
Improved your prompt using real failure data in the Playground.
Proved the improvement works across a representative dataset with experiments.

You now have a repeatable, data-driven process for improving your LLM application. No more guessing, no more hoping - you can measure quality and demonstrate improvement. Next up: Deepen your tracing foundation so your improvement loop stays grounded in complete, high-quality telemetry.

Quickstart

Instrument

Observe

Evaluate

Improve

Agents

Machine Learning

Settings

Security

Choose how you want to work