- The judge model: the LLM that produces the judgment
- A prompt template or rubric: the criteria used to make that judgment
- Your data: the examples being evaluated
Python Tutorial
Companion Python project with runnable examples
TypeScript Tutorial
Companion TypeScript project with runnable examples
Using Custom or OpenAI-Compatible Judge Models
In addition to standard hosted providers, Phoenix supports using custom or self-hosted judge models that are compatible with an existing provider SDK, such as OpenAI-compatible APIs. This allows you to run LLM-as-a-judge evaluations against internal inference services, private deployments, or alternative model hosts, while continuing to use the same evaluation templates and execution workflows. When configuring a judge model, you can pass any SDK-specific parameters required to reach your endpoint:base_url, api_key, or api_version. These settings control how Phoenix connects and authenticates with the model provider.
The same separation of responsibilities applies regardless of where the model is hosted:
- Connectivity and authentication are defined on the judge model
- Evaluation behavior (for example, temperature or token limits) is controlled by the evaluator
- Python
- TypeScript
Built-In Eval Templates in Phoenix
Phoenix includes a set of built-in eval templates that cover common evaluation tasks such as relevance, correctness, faithfulness, summarization quality, and toxicity. These templates encode a predefined rubric, structured outputs, and defaults that work well for LLM-as-a-judge workflows. You can find all built in templates here. Built-in templates are a good choice when you want reliable signal quickly without designing a rubric from scratch, especially early in iteration or when establishing a baseline. The example below shows a minimal setup using the built-in Correctness eval template with a configured judge model:- Python
- TypeScript

