Pre-Built Metrics

Phoenix provides pre-built evaluation metrics that can be used out of the box to assess LLM application quality. These metrics are available in both Python and TypeScript and are designed to work seamlessly with Phoenix’s tracing and experiment infrastructure. All LLM evaluation templates are tested against golden datasets and achieve an F1 score of 85% or higher on benchmarks.

LLM Evaluators

LLM evaluators use a judge model to assess the quality of outputs. These are useful for subjective or nuanced evaluations where simple rules don’t suffice.

Faithfulness

Measures whether a response is faithful to (grounded in) the provided context. Detects hallucinations and unsupported claims.

Correctness

Evaluates the general correctness of an LLM response.

Document Relevance

Assesses whether retrieved documents are relevant to the input query. Useful for RAG evaluation.

Tool Selection

Determines whether the correct tool was selected for a given context from the available options.

Tool Invocation

Checks if a tool was invoked correctly with proper arguments, formatting, and safe content.

Code Evaluators

Code evaluators use deterministic logic for evaluation. These are faster, cheaper, and provide consistent results for objective criteria.

Exact Match

Checks if the output exactly matches an expected value. Supports optional normalization.

Matches Regex

Validates that output matches a specified regular expression pattern.

Precision / Recall / F-Score

Computes precision, recall, and F1 scores for comparing predicted vs actual values.

Legacy Evaluators

Legacy evaluators are template-based evaluators from earlier versions of Phoenix. They remain available for backwards compatibility but we recommend using the modern evaluators above for new projects.

Q&A Evaluation

Evaluates Q&A correctness using legacy templates.

Retrieval / RAG Relevance

Legacy document relevance evaluation for RAG systems.

Summarization

Evaluates summary quality using legacy templates.

Toxicity

Legacy toxicity detection evaluation.

SQL Generation

Evaluates SQL query correctness using legacy templates.

Tool Calling (Legacy)

Legacy tool calling evaluation. Consider using Tool Invocation and Tool Selection instead.

Looking to create custom evaluators? See the Building Custom Evaluators guide.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

Pre-Built Metrics

LLM Evaluators

Faithfulness

Correctness

Document Relevance

Tool Selection

Tool Invocation

Code Evaluators

Exact Match

Matches Regex

Precision / Recall / F-Score

Legacy Evaluators

Q&A Evaluation

Retrieval / RAG Relevance

Summarization

Toxicity

SQL Generation

Tool Calling (Legacy)

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Settings

Concepts

Resources

​LLM Evaluators

Faithfulness

Correctness

Document Relevance

Tool Selection

Tool Invocation

​Code Evaluators

Exact Match

Matches Regex

Precision / Recall / F-Score

​Legacy Evaluators

Q&A Evaluation

Retrieval / RAG Relevance

Summarization

Toxicity

SQL Generation

Tool Calling (Legacy)

LLM Evaluators

Code Evaluators

Legacy Evaluators