Measure Performance with Evaluations

An evaluation produces a score or label for an output, so you can track quality across runs. Evaluations attach quality signals to runs so that correctness or relevance can be reasoned about consistently instead of judged case by case. Traces tell us what happened during a run, but they don’t tell us whether the output was good; evaluations fill that gap by letting us score outputs in a consistent, repeatable way. In this guide, we’ll set up evaluations in Phoenix so we can measure the quality of model outputs from a real application. We’ll start with data that already exists in Phoenix, define a simple evaluation, and run it so we can see results directly in the UI. The goal is to move from “I have model outputs” to “I can measure quality in a repeatable way.” Since we already have traces, we can take this a step further by scoring them against metrics like correctness, relevance, or custom checks that matter to your use case.

Before We Start

To follow along, you’ll need to have completed Get Started with Tracing so you should have:

Financial Analysis and Research Chatbot
Trace Data in Phoenix

Follow along with code: This guide has a companion codebase with runnable code examples. Find it here.

Step 1: Make Sure You Have Data in Phoenix

Before we can run evaluations, we need something to evaluate. Evaluations in Phoenix run over existing trace data. If you followed the tracing guide, you should already have:

A project in Phoenix
Traces containing LLM inputs and outputs

It’s best to have multiple traces so we can see how evaluation results vary from run to run. If needed, run your agent a few times with different inputs to generate more data. We can create a new folder in src/mastra called evals to hold the different scripts we will create during this evaluation guide. The first script we’ll create runs more queries to generate more trace data in our Phoenix project for evaluation. Before running this file, ensure that you have npm run dev in the background. Create a file called add_traces.ts:

import "dotenv/config";
import { MastraClient } from "@mastra/client-js";

const mastraClient = new MastraClient({
  baseUrl: "http://localhost:4111",
});

const agent = mastraClient.getAgent("financialOrchestratorAgent");

const questions = [
  "Research NVDA with focus on valuation metrics and growth prospects",
  "Research AAPL, MSFT with focus on comparative financial analysis",
  "Research META, SNAP, PINS with focus on social media sector trends",
  "Research RIVN with focus on financial health and viability",
  "Research KO with focus on dividend yield and stability",
  "Research META with focus on latest developments and stock performance",
  "Research AAPL, MSFT, GOOGL, AMZN, META with focus on big tech comparison and market outlook",
  "Research Apple with focus on financial analysis and market outlook",
];

for (const question of questions) {
  await agent.generate([{ role: "user", content: question }]);
  console.log(`Completed: ${question}`);
}

Step 2: Define an Evaluation

Now that we have trace data, the next question is how we decide whether an output is actually good. An evaluation makes that decision explicit. Instead of manually inspecting outputs or relying on intuition, we define a rule that Phoenix can apply consistently across many runs. In Phoenix, evaluations can be written in different ways. In this guide, we’ll use an LLM-as-a-judge evaluation as a simple starting point. This works well for questions like correctness or relevance, and lets us get metrics quickly. (If you’d rather use code-based evaluations, you can follow the guide on setting those up.) For LLM-as-a-judge evaluations, that means defining three things:

A prompt that describes the judgment criteria
An LLM that performs the evaluation
The data we want to score

In this step, we’ll define a basic completeness evaluation that checks whether the agent’s output completely answers the input. Phoenix also provides pre-built evaluation templates you can use or adapt for other metrics like relevance or hallucinations. First, create a file called evals.ts in src/mastra/evals to hold our evaluation code. Let’s start by adding our imports and constants at the top of this file. We’ll be using phoenix-evals to create our evaluator and phoenix-client to fetch our traces in code and push our annotations back to the project.

import "dotenv/config";
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";
import { getSpans, logSpanAnnotations } from "@arizeai/phoenix-client/spans";

const EVAL_NAME = "completeness";
const AGENT_SPAN_NAME = "invoke_agent Financial Analysis Orchestrator";
const PROJECT_NAME = process.env.PHOENIX_PROJECT_NAME ?? "mastra-tracing-quickstart";

Define the Evaluation Prompt

We’ll start by defining the prompt that tells the evaluator how to judge an answer.

const financialCompletenessTemplate = `
You are evaluating whether a financial research report correctly completes ALL parts of the user's task.

User input: {{input}}

Generated report:
{{output}}

To be marked as "correct", the report should:
1. Cover ALL companies/tickers mentioned in the input (if multiple are listed, all must be addressed)
2. Address ALL focus areas mentioned in the input (e.g., if user asks for "earnings and outlook", both must be covered)
3. Provide relevant financial information for each company/ticker requested

The report is "incorrect" if:
- It misses any company/ticker mentioned in the input
- It fails to address any focus area mentioned in the input
- It only partially covers the requested companies or topics

Examples:
- Input: "tickers: AAPL, MSFT, focus: earnings and outlook" → Report must cover BOTH AAPL AND MSFT, AND address BOTH earnings AND outlook
- Input: "tickers: TSLA, focus: valuation metrics" → Report must cover TSLA AND address valuation metrics
- Input: "tickers: NVDA, AMD, focus: comparative analysis" → Report must cover BOTH NVDA AND AMD AND provide comparison

Respond with ONLY one word: "complete" or "incomplete"
Then provide a brief explanation of which parts were completed or missed.
`;

This prompt defines what completeness means for our application.

Create the Evaluator

Now we can combine the prompt and model into an evaluator. We’ll wrap our evaluation logic in a main() function to handle async operations.

async function main() {
  const evaluator = createClassificationEvaluator({
    model: openai("gpt-4o-mini") as Parameters<typeof createClassificationEvaluator>[0]["model"],
    promptTemplate: financialCompletenessTemplate,
    choices: { complete: 1, incomplete: 0 },
    name: EVAL_NAME,
  });

At this point, we’ve defined how Phoenix should evaluate completeness, but we haven’t run it yet.

Step 3: Fetch and Filter Trace Data

Before we run our evaluator, we’ll need to pull down our trace data and prepare it to pass into the evaluator. We’ll get all the spans from Phoenix, filter for just the orchestrator agent spans, and extract their input and output values.

  const { spans } = await getSpans({
    project: { projectName: PROJECT_NAME },
    limit: 500,
  });

  const toEvaluate: { spanId: string; input: string; output: string }[] = [];
  for (const s of spans) {
    const span = s as {
      name?: string;
      span_name?: string;
      attributes?: Record<string, unknown>;
      context?: { span_id?: string };
      span_id?: string;
      id?: string;
    };
    if ((span.name ?? span.span_name) !== AGENT_SPAN_NAME) continue;
    const attrs = span.attributes ?? {};
    const rawInput = attrs["input.value"] ?? attrs["input"];
    const rawOutput = attrs["output.value"] ?? attrs["output"];
    const input =
      typeof rawInput === "string"
        ? rawInput
        : rawInput != null
          ? JSON.stringify(rawInput)
          : null;
    const output =
      typeof rawOutput === "string"
        ? rawOutput
        : rawOutput != null
          ? JSON.stringify(rawOutput)
          : null;
    const rawId = span.context?.span_id ?? span.span_id ?? span.id;
    const spanId = rawId != null ? String(rawId) : null;
    if (input && output && spanId) toEvaluate.push({ spanId, input, output });
  }

  console.log(`Found ${toEvaluate.length} orchestrator spans to evaluate`);

Step 4: Run the Evaluator

Now that we have our data and our evaluator, the next step is to run our evaluator on our data.

  const spanAnnotations = await Promise.all(
    toEvaluate.map(async ({ spanId, input, output }) => {
      const { label, score, explanation } = await evaluator.evaluate({
        input,
        output,
      });
      return {
        spanId,
        name: EVAL_NAME as "completeness",
        label,
        score,
        explanation,
        annotatorKind: "LLM" as const,
        metadata: { evaluator: EVAL_NAME, input, output },
      };
    }),
  );

This produces evaluation results for each span in the dataset.

Step 5: Log Evaluation Results to Phoenix

Finally, we’ll log the evaluation results back to Phoenix so they show up alongside our traces in the UI. This is what makes evaluations useful beyond a single run. Instead of living only in code, results become part of the same view you already use to understand behavior.

  await logSpanAnnotations({ spanAnnotations, sync: true });
  console.log(
    `Logged ${spanAnnotations.length} ${EVAL_NAME} evaluations to Phoenix`,
  );
}

Once this completes, head back to Phoenix. You’ll now see evaluation results attached to your trace data in the annotations column, making it easy to understand which runs passed, which failed, and how quality varies across executions.

Congratulations! You’ve run your first evaluation in Phoenix.

Learn More About Evals

Now that you have evaluation results in Phoenix, you can start using them to guide iteration. You can group traces with an incorrect label into a dataset, make changes to prompts or logic, and then run experiments on the same inputs to compare how outputs differ. The easiest and fastest way to iterate on your application without writing code is through prompt playground. The Iterate on Your Prompts guide walks through this workflow in more detail. To go deeper on evaluations, the Evaluations Tutorial covers writing more nuanced evaluators, using different scoring strategies, and comparing quality across runs as your application evolves. This was a simple example, but evaluations in Phoenix support much more advanced workflows over time.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

Measure Performance with Evaluations

Before We Start

Step 1: Make Sure You Have Data in Phoenix

Step 2: Define an Evaluation

Define the Evaluation Prompt

Create the Evaluator

Step 3: Fetch and Filter Trace Data

Step 4: Run the Evaluator

Step 5: Log Evaluation Results to Phoenix

Learn More About Evals

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

​Before We Start

​Step 1: Make Sure You Have Data in Phoenix

​Step 2: Define an Evaluation

​Define the Evaluation Prompt

​Create the Evaluator

​Step 3: Fetch and Filter Trace Data

​Step 4: Run the Evaluator

​Step 5: Log Evaluation Results to Phoenix

​Learn More About Evals

Before We Start

Step 1: Make Sure You Have Data in Phoenix

Step 2: Define an Evaluation

Define the Evaluation Prompt

Create the Evaluator

Step 3: Fetch and Filter Trace Data

Step 4: Run the Evaluator

Step 5: Log Evaluation Results to Phoenix

Learn More About Evals