- The judge model: the LLM that produces the judgment
- A prompt template or rubric: the criteria used to make that judgment
- Your data: the examples being evaluated
Python Tutorial
Companion Python project with runnable examples
TypeScript Tutorial
Companion TypeScript project with runnable examples
Configure Core LLM Setup
Evals need an LLM to act as the judge—the model that applies the rubric to your data. Configuring that judge is the first step. Phoenix Evals is provider-agnostic. You can run evaluations using any supported LLM provider without changing how your evaluators are written. Across both the Python and TypeScript evals libraries, a judge model is represented as a reusable configuration object. This object describes how Phoenix connects to a model provider, including the provider name, model identifier, credentials, and any SDK-specific client configuration. Invocation behavior (temperature, token limits, or other generation controls) is configured separately on the evaluator. This separation makes it possible to reuse the same judge model across multiple evals while tuning behavior per evaluation. The example below illustrates this separation by configuring a judge model independently of any specific evaluator:- Python
- TypeScript
Built-In Eval Templates in Phoenix
Phoenix includes a set of built-in eval templates that cover common evaluation tasks such as relevance, correctness, faithfulness, summarization quality, and toxicity. These templates encode a predefined rubric, structured outputs, and defaults that work well for LLM-as-a-judge workflows. You can find all built in templates here. Built-in templates are a good choice when you want reliable signal quickly without designing a rubric from scratch, especially early in iteration or when establishing a baseline. The example below shows a minimal setup using the built-in Correctness eval template with a configured judge model:- Python
- TypeScript
Running Evals on Phoenix Traces
With a judge model and evaluator defined, the next step is running evals on real application data. A common workflow is evaluating traced executions and attaching results back to spans in Phoenix. Once attached, you can inspect failures and edge cases in the UI, compare behavior across runs, and use eval results as inputs to datasets and experiments. 1. Export trace spans Start by exporting spans from a Phoenix project into a tabular structure:- Python
- TypeScript
attributes.input.value & attributes.output.value
Input mappings help bridge differences between how data is stored in traces and what evaluators expect.
- Python
- TypeScript
- Python
- TypeScript
- Python
- TypeScript
With built-in evals running on traced data, you can now:
- Inspect failures and edge cases
- Compare behavior across runs
- Use eval results as inputs to datasets and experiments

