Before We Start
To follow along, you’ll need to have completed Get Started with Tracing so you should have:- Financial Analysis and Research Chatbot
- Trace Data in Phoenix
Follow along with code: This guide has a companion codebase with runnable code examples. Find it here.
Step 1: Make Sure You Have Data in Phoenix
Before we can run evaluations, we need something to evaluate. Evaluations in Phoenix run over existing trace data. If you followed the tracing guide, you should already have:- A project in Phoenix
- Traces containing LLM inputs and outputs
src/mastra called evals to hold the different scripts we will create during this evaluation guide.
The first script we’ll create runs more queries to generate more trace data in our Phoenix project for evaluation. Before running this file, ensure that you have npm run dev in the background.
Create a file called add_traces.ts:
Step 2: Define an Evaluation
Now that we have trace data, the next question is how we decide whether an output is actually good. An evaluation makes that decision explicit. Instead of manually inspecting outputs or relying on intuition, we define a rule that Phoenix can apply consistently across many runs. In Phoenix, evaluations can be written in different ways. In this guide, we’ll use an LLM-as-a-judge evaluation as a simple starting point. This works well for questions like correctness or relevance, and lets us get metrics quickly. (If you’d rather use code-based evaluations, you can follow the guide on setting those up.) For LLM-as-a-judge evaluations, that means defining three things:- A prompt that describes the judgment criteria
- An LLM that performs the evaluation
- The data we want to score
evals.ts in src/mastra/evals to hold our evaluation code.
Let’s start by adding our imports and constants at the top of this file. We’ll be using phoenix-evals to create our evaluator and phoenix-client to fetch our traces in code and push our annotations back to the project.
Define the Evaluation Prompt
We’ll start by defining the prompt that tells the evaluator how to judge an answer.Create the Evaluator
Now we can combine the prompt and model into an evaluator. We’ll wrap our evaluation logic in amain() function to handle async operations.

