Skip to main content
An experiment is a structured comparison between versions of your application using the same inputs and evaluation criteria. In this guide, you’ll pull down an existing dataset and run experiments in code to compare different versions and verify whether changes actually improve quality. At this point, you should already have a dataset from previous runs and at least one evaluation attached to those runs; experiments let you rerun that dataset through an updated version of your application and compare results side by side.

Before We Start

To follow along, you should already have:
  • Traces and Evals attached to a project in Phoenix
  • A dataset created from previous runs, such as failed traces

Follow along with code: This guide has a companion codebase with runnable code examples. Find it here.

Step 1: Use Explanations to Identify Improvements

We’ll use our dataset to group application failures together. The next step is deciding which issues to fix. Using the explanations from the evals we ran previously and the trace context, we can understand why these runs failed. Looking at the traces in this dataset, you might notice patterns such as unclear instructions, missing constraints, or outputs that don’t follow the expected structure. The easiest way to see these is to go back into the trace view for these failed runs and read the explanations for why they were each labeled as “incomplete” answers. In this example, we’ll improve the agent by strengthening the agent’s instructions so the model has clearer guidance on what a good response looks like. First, let’s set up our imports:
import "dotenv/config";
import { getDataset } from "@arizeai/phoenix-client/datasets";
import type { Example } from "@arizeai/phoenix-client/types/datasets";
import { runExperiment } from "@arizeai/phoenix-client/experiments";
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
import { Agent } from "@mastra/core/agent";
import { Mastra } from "@mastra/core/mastra";
import { financialSearchTool } from "../tools/financial-search-tool";
import { financialOrchestratorAgent } from "../agents/financial-orchestrator-agent";
import { financialWriterAgent } from "../agents/financial-writer-agent";
import { financialCompletenessTemplate } from "../evals/evals";

Update the Agent Instructions

Below is an example of tightening the agent instructions to be more explicit about the expected output.
const financialResearcherAgent = new Agent({
  id: "financial-researcher-agent",
  name: "Financial Research Analyst",
  instructions: `You are a Senior Financial Research Analyst. Your job is to collect accurate, up-to-date financial information so a report writer can turn it into a polished analysis.

What to do:
- Use the financialSearch tool to look up each company or ticker mentioned in the request.
- For every ticker, pull: current or recent prices, key ratios (P/E, P/B, debt-to-equity, ROE), revenue and earnings, and notable news or events from the last 6 months.
- If the user asks for a specific focus (e.g. valuation, growth, dividends), prioritize that in your search and summary.
- For multiple tickers, run research per ticker and then summarize in one coherent research brief.

Output:
- Produce a single research summary that covers all requested tickers and focus areas.
- Be specific: use numbers and sources, not vague statements.
- Write so the Financial Report Writer can use this summary directly to draft the final report.

Make sure to report financial data for all tickers mentioned in the request. Use that financial data for the specific focus area mentioned in the request.`,
  model: "openai/gpt-4o",
  tools: { financialSearchTool },
});
Create an updated Mastra instance with this new, modified agent:
async function main() {
  const mastra = new Mastra({
    agents: {
      financialResearcherAgent,
      financialWriterAgent,
      financialOrchestratorAgent,
    },
  });
At this point, we’ve made a targeted change based on the explanations for why traces were classified as failures.

Step 2: Define an Experiment

Now that we’ve updated the agent, we’ll run this new agent flow to test whether the changes actually improve quality. Experiments in Phoenix let you rerun the same inputs through different versions of your application and compare the results side by side. This helps ensure that improvements are measured. To define an experiment, we need to specify:
  • The experiment task A task is a function or process that takes each example from a dataset and produces an output, typically by running your application logic or model on the input.
  • The experiment evaluation An experiment evaluation is essentially the same as a regular evaluation, but specifically assesses the quality of a task’s output, often by comparing it to an expected result or applying a scoring metric.
In this guide, the task for the experiment is simply to rerun the agent using the updated instructions to see improvements. Since we’re re-running our agent system on these inputs and getting new outputs, we’ll rerun the same evaluation to directly compare results.

Define the Task

  const task = async (example: Example): Promise<string> => {
    const raw = example.input as unknown as
      | { role: "user"; content: string }[]
      | { input: { role: "user"; content: string }[] };
    const messages = Array.isArray(raw) ? raw : raw.input;
    const response = await mastra
      .getAgent("financialOrchestratorAgent")
      .generate(messages);
    return response.text ?? "";
  };

Define the Evaluator

  const completenessEvaluator = createClassificationEvaluator({
    model: openai("gpt-4o-mini"),
    promptTemplate: financialCompletenessTemplate,
    choices: { complete: 1, incomplete: 0 },
    name: "completeness",
  });

Step 3: Run the Experiment on the Dataset

Next, we’ll specify the dataset we created earlier and run the experiment on it. This ensures we’re testing the new version of the agent on the exact same inputs that previously failed.
  const datasetSelector = { datasetName: "ts quickstart fails" };

  await runExperiment({
    dataset: datasetSelector,
    task,
    evaluators: [completenessEvaluator],
    experimentName: "new-experiment",
  });
}
Once this completes, Phoenix logs the experiment results automatically.

Step 4: View Experiment Results in Phoenix

Head back to Phoenix and open the Experiments view. Here, you can see:
  • The original runs compared against the new ones under ‘reference output’
  • New application runs as a result of our task
  • Evaluation results for each version
In this example, we should see more traces receiving a complete label, indicating that the changes improved performance.
Congratulations! You’ve created your first dataset and run your first experiment in Phoenix.

Learn More About Datasets and Experiments

This was a simple example, but datasets and experiments support much more advanced workflows. If you want to test prompt changes to a specific part of your application and keep track of different prompt versions, the Prompt Playground guide walks through how to do that. To go deeper with datasets and experiments, you can build datasets for specific user segments or edge cases, compare multiple prompt or model variants, and track quality improvements over time as your application evolves. The Datasets and Experiments section covers these patterns in more detail.