Python Tutorial
Companion Python project with runnable examples
TypeScript Tutorial
Companion TypeScript project with runnable examples
How Custom Evaluators Work
A custom evaluator is defined by a prompt template that guides the judge model through a specific decision. The most effective templates follow the same order the judge reads and reasons about information. Start by defining the judge’s role and task. Rather than asking an open-ended question, the prompt should act like a rubric. It should clearly state what is being evaluated and which criteria the judge should apply. Explicit instructions make judgments easier to reproduce, while vague language leads to inconsistent results. Next, present the data to be evaluated. In most cases, this includes the input that produced the output and the output itself. Some evaluations require additional context, such as retrieved documents or reference material, but this should be included only when necessary. Clearly labeling each part of the data and using consistent formatting helps reduce ambiguity. Many templates use a delimited section (such as BEGIN DATA / END DATA) to make boundaries explicit. Finally, constrain the allowed outputs. Most custom evaluators use classification-style outputs that return a single label per example. Labels like correct / incorrect or relevant / irrelevant are easy to compare across runs and integrate cleanly with Phoenix’s logging and analysis tools. While other output formats are possible, categorical labels are generally the most stable and interpretable starting point.Define a Custom Evaluator
The example below shows a customized version of the built-in correctness evaluation, adapted for a travel planning agent. Compared to the generic template, this version encodes application-specific expectations around essential information, budget clarity, and local context. By making these criteria explicit, the resulting evaluation signal is more informative and more useful for identifying concrete areas for improvement.- Python
- TypeScript
Create the Custom Evaluator
Once the template is defined, you can create a custom evaluator using any supported judge model. This example uses a built in, classic OpenAI LLM model, but you can use any judge model.- Python
- TypeScript
Run the Evaluator on Traced Data
Once defined, custom evaluators can be run the same way as built-in templates, either on individual examples or in batch over trace-derived data. 1. Export trace spans Start by exporting spans from a Phoenix project:- Python
- TypeScript
attributes.input.value & attributes.output.value
Input mappings help bridge differences between how data is stored in traces and what evaluators expect.
- Python
- TypeScript
- Python
- TypeScript
- Python
- TypeScript

