Skip to main content
Phoenix evaluators support multiple prompt formats, all compatible with supported models and providers.

Supported Formats

1. String Prompts

Simple string templates with variable placeholders.
evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=llm,
    prompt_template="Classify the sentiment: {text}",
    choices=["positive", "negative", "neutral"]
)

2. Message Lists

Arrays of message objects with role and content fields.
evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "Evaluate the answer helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices=["helpful", "somewhat_helpful", "not_helpful"]
)
Supported roles:
  • "system" - Instructions for the model.
  • "user" - User messages and input context.
  • "assistant" - Assistant/model responses (for multi-turn conversations or few-shot examples)

3. Structured Content Parts (Python only)

Messages with multiple content parts, useful for separating different pieces of context.
Only text content is supported at this time.
evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Question: {question}"},
                {"type": "text", "text": "Answer: {answer}"}
            ]
        }
    ],
    choices=["relevant", "not_relevant"]
)

Template Variables

All formats support variable substitution. Python supports both f-string ({variable}) and mustache ({{variable}}) syntax, while TypeScript supports mustache syntax only.
# Variables are provided when calling .evaluate()
result = evaluator.evaluate({
    "question": "What is Python?",
    "answer": "A programming language"
})

Using Phoenix Prompt Versions as Eval Templates (Python)

If your prompt is already stored in Phoenix Prompt Management, you can convert it directly into an evals PromptTemplate with phoenix_prompt_to_prompt_template.
from phoenix.client import Client
from phoenix.evals import (
    ClassificationEvaluator,
    LLM,
    phoenix_prompt_to_prompt_template,
)

client = Client(base_url="http://localhost:6006")
prompt_version = client.prompts.get(prompt_identifier="test-prompt")

prompt_template = phoenix_prompt_to_prompt_template(prompt_version)

evaluator = ClassificationEvaluator(
    name="recipe_quality",
    llm=LLM(provider="openai", model="gpt-4o-mini"),
    prompt_template=prompt_template,
    choices=["good", "bad"],
)
Notes:
  • This utility accepts either a Phoenix PromptVersion object or a PromptVersionData-like dictionary.
  • Role normalization supports Phoenix role aliases (ai/model -> assistant, developer -> system), including mixed-case role names.
  • For structured content parts, only text parts are currently supported ({"type": "text", "text": ...}).

Client-Specific Behavior

All clients accept the same message format as input. Adapters handle client-specific transformations internally as needed:

OpenAI

  • System role is converted to developer role for reasoning models.
  • Otherwise, messages are passed as-is.

Anthropic

  • System messages are extracted and passed via system parameter
  • User/assistant messages sent in messages array

Google GenAI

  • System messages are extracted and passed via system_instruction in config
  • Assistant role converted to model role
  • Messages sent in contents array

LiteLLM

  • Messages passed directly to LiteLLM in OpenAI format
  • LiteLLM handles provider-specific conversions internally

LangChain

  • OpenAI format messages are converted to LangChain message objects (HumanMessage, AIMessage, SystemMessage)

Full Example

A complete example showing evaluator setup and usage:
from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

evaluator = ClassificationEvaluator(
    name="helpfulness",
    llm=llm,
    prompt_template=[
        {"role": "system", "content": "You evaluate response helpfulness."},
        {"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
    ],
    choices=["helpful", "somewhat_helpful", "not_helpful"]
)

result = evaluator.evaluate({
    "question": "How do I learn Python?",
    "answer": "Start with online tutorials and practice daily."
})

print(result[0].label)  # e.g., "helpful"