Skip to main content
Evaluations measure the quality of your AI application’s outputs — whether responses are accurate, grounded, safe, or relevant to the user’s intent. Unlike traditional software, LLM outputs can’t be tested with simple assertions. Evaluations give you a systematic way to catch regressions, compare model or prompt changes, and build confidence before shipping. Phoenix supports both deterministic code-based evaluators (exact match, regex, custom heuristics) and LLM-as-a-judge evaluators, where a second model scores the output against a rubric. You can run evaluations on traces from production, on experiment results, or on any dataset. Phoenix provides two approaches:

Two Ways to Evaluate

Features

  • Model Agnostic via adapters (for OpenAI, LiteLLM, LangChain, AI SDK, and more) — so you can use easily switch judge models, or stick to your preferred provider.
  • Powerful input mapping system for working with complex data structures — easily map nested data and complex inputs to evaluator requirements.
  • Pre-built metrics for common evaluation tasks and use cases like RAG and tool-calling agents.
  • Evaluators are natively instrumented via OpenTelemetry tracing for observability and dataset curation.
  • Blazing fast performance — achieve up to 20x speedup with built-in concurrency and batching.
  • Built-in Explanations — all Phoenix LLM evaluations return explanations by default for better results and richer signals.

Structured Output via Tool Calling

LLM evaluators use function calling (tool use) to extract structured judgments rather than parsing freeform text. Phoenix generates a tool from the evaluator’s output config — for example:
{
  "name": "correctness",
  "parameters": {
    "type": "object",
    "properties": {
      "label": {
        "type": "string",
        "enum": ["correct", "incorrect"]
      },
      "explanation": {
        "type": "string"
      }
    },
    "required": ["label", "explanation"]
  }
}
The LLM is required to call this tool rather than respond with freeform text. Phoenix parses the tool call and maps the returned label to its numeric score. This applies to both the client-side SDK and server-side evaluators.

Executors

Under the hood, the Phoenix uses executors to run evaluations faster and more reliably. Phoenix automatically handles the infrastructure complexity:
  • Rate limit handling: automatically retries when LLM providers throttle requests
  • Error management: distinguishes between temporary failures and permanent errors so retries don’t waste API budget
  • Dynamic concurrency: adjusts parallelism based on provider performance to maximize throughput without triggering rate limits
This means you can run thousands of evaluations without writing any retry or concurrency logic yourself.

Evaluator Tracing

Evaluator traces page
Evaluator trace detail
All evaluator runs are automatically traced via OpenTelemetry and sent to a dedicated Phoenix project. This gives you complete transparency into how your evaluators make decisions — essential for validating prompt engineering and achieving human alignment. Every evaluation execution captures the input data, the exact prompts sent to the judge LLM, the model’s full reasoning, the final scores, and execution timing. Use Phoenix’s trace viewer to explore evaluation traces, identify systematic biases, and continuously improve evaluator performance.
For continuous monitoring of application performance — evals on production traffic with alerting and threshold-based triggers — see Arize AX Online Evals.