Skip to main content
Phoenix provides pre-built evaluation metrics that can be used out of the box to assess LLM application quality. These metrics are available in both Python and TypeScript and are designed to work seamlessly with Phoenix’s tracing and experiment infrastructure. All LLM evaluation templates are tested against golden datasets and achieve an F1 score of 85% or higher on benchmarks.

LLM Evaluators

LLM evaluators use a judge model to assess the quality of outputs. These are useful for subjective or nuanced evaluations where simple rules don’t suffice.

Code Evaluators

Code evaluators use deterministic logic for evaluation. These are faster, cheaper, and provide consistent results for objective criteria.

Legacy Evaluators

Legacy evaluators are template-based evaluators from earlier versions of Phoenix. They remain available for backwards compatibility but we recommend using the modern evaluators above for new projects.
Looking to create custom evaluators? See the Building Custom Evaluators guide.