Skip to main content
Phoenix provides pre-built evaluation metrics that can be used out of the box to assess LLM application quality. These metrics are available in both Python and TypeScript and are designed to work seamlessly with Phoenix’s tracing and experiment infrastructure. All LLM evaluation templates are tested against golden datasets and achieve an F1 score of 85% or higher on benchmarks.

LLM Evaluators

LLM evaluators use a judge model to assess the quality of outputs. These are useful for subjective or nuanced evaluations where simple rules don’t suffice.

Faithfulness

Measures whether a response is faithful to (grounded in) the provided context. Detects hallucinations and unsupported claims.

Conciseness

Evaluates whether a response is concise and free of unnecessary content like filler, hedging, and meta-commentary.

Correctness

Evaluates the general correctness of an LLM response.

Document Relevance

Assesses whether retrieved documents are relevant to the input query. Useful for RAG evaluation.

Tool Selection

Determines whether the correct tool was selected for a given context from the available options.

Tool Invocation

Checks if a tool was invoked correctly with proper arguments, formatting, and safe content.

Tool Response Handling

Evaluates whether an agent correctly processed a tool’s result, including error handling, data extraction, and safe information disclosure.

Refusal

Detects when an LLM refuses, declines, or avoids answering a user query.

Code Evaluators

Code evaluators use deterministic logic for evaluation. These are faster, cheaper, and provide consistent results for objective criteria.

Exact Match

Checks if the output exactly matches an expected value. Supports optional normalization.

Matches Regex

Validates that output matches a specified regular expression pattern.

Precision / Recall / F-Score

Computes precision, recall, and F1 scores for comparing predicted vs actual values.

Legacy Evaluators

Legacy evaluators are template-based evaluators from earlier versions of Phoenix. They remain available for backwards compatibility but we recommend using the modern evaluators above for new projects.

Q&A Evaluation

Evaluates Q&A correctness using legacy templates.

Retrieval / RAG Relevance

Legacy document relevance evaluation for RAG systems.

Summarization

Evaluates summary quality using legacy templates.

Toxicity

Legacy toxicity detection evaluation.

SQL Generation

Evaluates SQL query correctness using legacy templates.

Tool Calling (Legacy)

Legacy tool calling evaluation. Consider using Tool Invocation and Tool Selection instead.
Looking to create custom evaluators? See the Building Custom Evaluators guide.