Phoenix evaluators support multiple prompt formats, all compatible with supported models and providers.

Supported Formats

1. String Prompts

Simple string templates with variable placeholders.

Python
TypeScript

evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=llm,
    prompt_template="Classify the sentiment: {text}",
    choices={"positive": 1.0, "negative": 0.0, "neutral": 0.5}
)

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const evaluator = createClassificationEvaluator({
  name: "sentiment",
  model,
  promptTemplate: "Classify the sentiment: {{text}}",
  choices: { positive: 1, negative: 0, neutral: 0.5 },
});

2. Message Lists

Arrays of message objects with role and content fields.

Python
TypeScript

evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "Evaluate the answer helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices={"helpful": 1.0, "somewhat_helpful": 0.5, "not_helpful": 0.0}
)

Supported roles:

"system" - Instructions for the model.
"user" - User messages and input context.
"assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const evaluator = createClassificationEvaluator({
  name: "helpfulness",
  model,
  promptTemplate: [
    { role: "system", content: "Evaluate the answer helpfulness." },
    { role: "user", content: "Question: {{question}}\nAnswer: {{answer}}" },
  ],
  choices: { helpful: 1, somewhat_helpful: 0.5, not_helpful: 0 },
});

Supported roles:

"system" - Instructions for the model.
"user" - User messages and input context.
"assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)

3. Structured Content Parts (Python only)

Messages with multiple content parts, useful for separating different pieces of context.

Python
TypeScript

Only text content is supported at this time.

evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Question: {question}"},
                {"type": "text", "text": "Answer: {answer}"}
            ]
        }
    ],
    choices={"relevant": 1.0, "not_relevant": 0.0}
)

Template Variables

All formats support variable substitution. Python supports both f-string ({variable}) and mustache ({{variable}}) syntax, while TypeScript supports mustache syntax only.

Python
TypeScript

# Variables are provided when calling .evaluate()
result = evaluator.evaluate({
    "question": "What is Python?",
    "answer": "A programming language"
})

// Variables are provided when calling .evaluate()
const result = await evaluator.evaluate({
  question: "What is Python?",
  answer: "A programming language",
});

console.log(result.label); // e.g., "relevant"

Writing a prompt template

A successful judge prompt template has four elements:

Define the judge’s role

In the first part of your prompt, define the judge’s role. Avoid framing like “you are an expert evaluator”: it rarely helps and can sometimes make results worse. Instead, focus on giving the judge context: what type of system it is evaluating, what industry or domain that system operates in, and what the judge’s task is. For example, telling the judge “you are identifying issues with the relevance of an agent’s responses so we can improve the experience for our users” establishes the system under evaluation, the quality dimension you care about, and the goal of the evaluation.

Explicit criteria

Avoid ambiguous or aspirational instructions like “a good response” or “a helpful answer”. Focus on explicit instructions: what specific elements of a response would make it helpful? For example, for a financial agent, one criterion might be “Contains a specific buy/sell/hold recommendation”, or for a customer service agent it might be “mentions specific actions to take in the UI to resolve the issue”. Also include criteria for failure: what would make the response not helpful? This is often drawn from inspecting traces. Be careful not to over-specify. Modern LLMs follow instructions very closely, so a long list of rigid rules can constrain the judge in ways you don’t intend. A criterion like “must contain a specific buy/sell/hold recommendation” may be too strict compared to a more open-ended goal like “consider whether the response provides an appropriate next step when the user asks for advice on whether to buy, sell, or hold an asset” — especially when the judge already has the context that it is evaluating a system inside a financial institution.

Include labeled data

Include variable names that will be expanded at runtime, e.g. {input} and {output} or {question} and {answer} (see above). In your template, surround these variables with clear labels to the LLM so that it understands where your instructions end and inputs and outputs begin and end. XML tags are a clear way to mark where each block begins and ends:

<user_query>
{input}
</user_query>

<financial_report>
{output}
</financial_report>

Don’t specify the output format

You don’t need to tell the LLM what labels to output or describe a response format in your prompt. You define the possible responses externally, as the evaluator’s choices (see above), and Phoenix builds an output schema from those choices so that responses are easy to parse. Phoenix uses the model’s structured output mode where it is available, or otherwise defines a single tool and requires the model to call it. In all cases Phoenix ensures the LLM responds with one of your choices, so your prompt can focus on the evaluation criteria rather than the output format.

Using Phoenix Prompt Versions as Eval Templates (Python)

If your prompt is already stored in Phoenix Prompt Management, you can convert it directly into an evals PromptTemplate with phoenix_prompt_to_prompt_template.

from phoenix.client import Client
from phoenix.evals import (
    ClassificationEvaluator,
    LLM,
    phoenix_prompt_to_prompt_template,
)

client = Client(base_url="http://localhost:6006")
prompt_version = client.prompts.get(prompt_identifier="test-prompt")

prompt_template = phoenix_prompt_to_prompt_template(prompt_version)

evaluator = ClassificationEvaluator(
    name="recipe_quality",
    llm=LLM(provider="openai", model="gpt-4o-mini"),
    prompt_template=prompt_template,
    choices={"good": 1.0, "bad": 0.0},
)

Notes:

This utility accepts either a Phoenix PromptVersion object or a PromptVersionData-like dictionary.
Role normalization supports Phoenix role aliases (ai/model -> assistant, developer -> system), including mixed-case role names.
For structured content parts, only text parts are currently supported ({"type": "text", "text": ...}).

Client-Specific Behavior

Python
TypeScript

All clients accept the same message format as input. Adapters handle client-specific transformations internally as needed:

OpenAI

System role is converted to developer role for reasoning models.
Otherwise, messages are passed as-is.

Anthropic

System messages are extracted and passed via system parameter
User/assistant messages sent in messages array

Google GenAI

System messages are extracted and passed via system_instruction in config
Assistant role converted to model role
Messages sent in contents array

LiteLLM

Messages passed directly to LiteLLM in OpenAI format
LiteLLM handles provider-specific conversions internally

LangChain

OpenAI format messages are converted to LangChain message objects (HumanMessage, AIMessage, SystemMessage)

Full Example

A complete example showing evaluator setup and usage:

Python
TypeScript

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "You evaluate response helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices={"helpful": 1.0, "somewhat_helpful": 0.5, "not_helpful": 0.0}
)

result = evaluator.evaluate({
    "question": "How do I learn Python?",
    "answer": "Start with online tutorials and practice daily."
})

print(result[0].label)  # e.g., "helpful"

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const evaluator = createClassificationEvaluator({
  name: "helpfulness",
  model,
  promptTemplate: [
    { role: "system", content: "You evaluate response helpfulness." },
    { role: "user", content: "Question: {{question}}\nAnswer: {{answer}}" },
  ],
  choices: { helpful: 1, somewhat_helpful: 0.5, not_helpful: 0 },
});

const result = await evaluator.evaluate({
  question: "How do I learn Python?",
  answer: "Start with online tutorials and practice daily.",
});

console.log(result.label); // e.g., "helpful"

Get Started

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Eval Prompt Templates

Supported Formats

1. String Prompts

2. Message Lists

3. Structured Content Parts (Python only)

Template Variables

Writing a prompt template

Define the judge’s role

Explicit criteria

Include labeled data

Don’t specify the output format

Using Phoenix Prompt Versions as Eval Templates (Python)

Client-Specific Behavior

OpenAI

Anthropic

Google GenAI

LiteLLM

LangChain

Full Example

​Supported Formats

​1. String Prompts

​2. Message Lists

​3. Structured Content Parts (Python only)

​Template Variables

​Writing a prompt template

​Define the judge’s role

​Explicit criteria

​Include labeled data

​Don’t specify the output format

​Using Phoenix Prompt Versions as Eval Templates (Python)

​Client-Specific Behavior

​OpenAI

​Anthropic

​Google GenAI

​LiteLLM

​LangChain

​Full Example

Supported Formats

1. String Prompts

2. Message Lists

3. Structured Content Parts (Python only)

Template Variables

Writing a prompt template

Define the judge’s role

Explicit criteria

Include labeled data

Don’t specify the output format

Using Phoenix Prompt Versions as Eval Templates (Python)

Client-Specific Behavior

OpenAI

Anthropic

Google GenAI

LiteLLM

LangChain

Full Example