Skip to main content

Overview

The PrecisionRecallFScore evaluator computes precision, recall, and F-beta scores for comparing predicted labels against expected labels. It supports both binary and multi-class classification with various averaging strategies.

When to Use

Use the PrecisionRecallFScore evaluator when you need to:
  • Evaluate classification performance - Measure how well your model predicts correct labels
  • Compare label sequences - Assess predicted vs expected labels for multi-item outputs
  • Binary classification metrics - Compute metrics for spam/ham, positive/negative, etc.
  • Multi-class evaluation - Evaluate across multiple categories with different averaging strategies
This is a code-based evaluator that computes standard classification metrics. Both expected and output should be sequences of labels (strings or integers).

Supported Levels

This evaluator is not tied to specific tracing levels. It operates on lists of predicted and expected labels, making it useful for:
  • Comparing model predictions against ground truth labels
  • Evaluating classification outputs at any level where you have paired label sequences
  • Batch evaluation of classification tasks in experiments

Input Requirements

The PrecisionRecallFScore evaluator requires two inputs:
FieldTypeDescription
expectedList[str | int]List of expected/true labels
outputList[str | int]List of predicted labels
Both sequences must have the same length and contain at least one element.
For numeric labels {0, 1}, the evaluator automatically treats 1 as the positive class.

Constructor Arguments

ArgumentTypeDefaultDescription
betafloat1.0Weight of recall relative to precision (F1 by default)
averagestr"macro"Averaging strategy: "macro", "micro", or "weighted"
positive_labelstr | intNoneFor binary classification, specify the positive class
zero_divisionfloat0.0Value to use when a metric is undefined (0/0)

Output Interpretation

The evaluator returns three Score objects:
Score NameDescription
precisionRatio of true positives to predicted positives
recallRatio of true positives to actual positives
f1 (or f{beta})Harmonic mean of precision and recall
All scores have:
  • direction = "maximize" (higher is better)
  • kind = "code" (code-based evaluator)

Averaging Strategies

For multi-class classification.
StrategyDescription
macroCalculate metrics for each class, then average (treats all classes equally)
microCalculate metrics globally by counting total TP, FP, FN
weightedAverage weighted by class support (number of true instances)

Usage Examples

from phoenix.evals.metrics import PrecisionRecallFScore

# Create evaluator with default settings (F1, macro averaging)
evaluator = PrecisionRecallFScore()

# Inspect the evaluator's requirements
print(evaluator.describe())

# Multi-class evaluation
eval_input = {
    "expected": ["cat", "dog", "cat", "bird", "dog"],
    "output": ["cat", "cat", "cat", "bird", "dog"]
}

scores = evaluator.evaluate(eval_input)
for score in scores:
    print(f"{score.name}: {score.score:.3f}")
# precision: 0.889
# recall: 0.833
# f1: 0.833

Binary Classification

For binary classification, specify the positive label:
from phoenix.evals.metrics import PrecisionRecallFScore

# Binary classification for spam detection
evaluator = PrecisionRecallFScore(positive_label="spam")

eval_input = {
    "expected": ["spam", "ham", "spam", "ham", "spam"],
    "output": ["spam", "spam", "ham", "ham", "spam"]
}

scores = evaluator.evaluate(eval_input)
for score in scores:
    print(f"{score.name}: {score.score:.3f}")
# precision: 0.667  (2 TP / 3 predicted spam)
# recall: 0.667     (2 TP / 3 actual spam)
# f1: 0.667

Using with Phoenix

Evaluating Traces

Run evaluations on traces collected in Phoenix and log results as annotations:

Running Experiments

Use the PrecisionRecallFScore evaluator in Phoenix experiments:

API Reference