Metrics

Chapter Summary

This chapter details core LLM evaluation metric categories, trade-offs, and real-world application patterns — expanding on the foundational concepts of LLM evaluation covered earlier in this guide.

Get started with LLM evaluation by following our quickstart guide in docs.

Large language models (LLMs) are probabilistic systems. Evaluating them requires more than a pass/fail mindset. Metrics are the bridge between human expectations and machine behavior — offering objective ways to monitor performance, debug regressions, and guide iterative improvement. Whether you’re assessing offline golden dataset performance or live application outputs, clear and appropriate metrics are the key to continuous LLM system improvement.

Categories of LLM Evaluation Metrics

LLM evaluation metrics are often grouped based on what aspect of the LLM system is being measured: correctness; relevance; hallucinations and faithfulness; toxicity and safety; or fluency, coherence, and helpfulness.

Correctness

Definition: Correctness determines if the LLM’s output accurately answers the given question or completes the specified task.

Evaluation Methods

Correctness evaluation methods include:

  • Binary Classification: Labels outputs as either Correct or Incorrect.
  • Multi-Class Classification: Categorizes outputs into Fully Correct, Partially Correct, or Incorrect.
  • Statistical Measures: Utilizes metrics like Precision, Recall, and F1 Score when comparing against a golden dataset.

Use Cases

Correctness can be useful with:

Correctness: Integration Example

This example leverages Phoenix, Arize’s open source evaluation and tracing platform.

import phoenix.evals.templates.default_templates as templates
from phoenix.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

rails = list(templates.QA_PROMPT_RAILS_MAP.values())
Q_and_A_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.QA_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Relevance

Definition: Relevance assesses how well the LLM’s output or retrieved documents align with the user’s query or intent.

Evaluation Methods

Relevance evaluation methods include:

  • Binary Classification: Labels as Relevant or Irrelevant.
  • Ranking Metrics: Measures like Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (nDCG).

Use Cases

Relevance can be useful with:

  • Evaluating RAG retrieval quality
  • Search and summarization applications

Relevance: Implementation Example

This example leverages Phoenix:

from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Hallucination / Faithfulness

Definition: Determines if the LLM’s output introduces information not grounded in the provided context.

Evaluation Methods

Hallucination and faithfulness evaluation methods include:

  • Binary Classification: Labels as Factual or Hallucinated.
  • Explanatory Feedback: Provides reasoning behind the classification.

Use Cases

Hallucination and faithfulness can be useful with:

  • Evaluating RAG retrieval quality
  • Search and summarization applications

đź’ˇ Check out our guide to LLM hallucination examples.

Implementation Example

This example leverages Phoenix:

from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())
hallucination_classifications = llm_classify(
    dataframe=df, 
    template=HALLUCINATION_PROMPT_TEMPLATE, 
    model=model, 
    rails=rails,
    provide_explanation=True
)

Toxicity and Safety

Definition: Toxicity and safety evaluates whether the LLM’s output contains harmful, biased, or inappropriate content.

Evaluation Methods

Toxicity and safety evaluation methods include:

  • Binary Classification: Labels as Safe or Unsafe.
  • Scoring Systems: Assigns severity levels to unsafe content.

Use Cases

Toxicity and safety evaluations can be useful with:

  • Monitoring consumer-facing chatbots
  • Ensuring compliance in enterprise applications

Implementation Example

This example leverages Phoenix:

from phoenix.evals import (
    TOXICITY_PROMPT_RAILS_MAP,
    TOXICITY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
    dataframe=df_sample,
    template=TOXICITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Fluency, Coherence, Helpfulness

Definition: Fluency, coherency, and helpfulness evals measure the linguistic quality and usefulness of the LLM’s output.

Evaluation Methods

Fluency, coherency, and helpfulness evaluation methods include:

  • Likert Scale Ratings: Scores ranging from 1 to 5.
  • Pairwise Comparisons: Evaluates which of two outputs is more helpful or coherent.

Use Cases

Fluency, coherency, and helpfulness can be useful with:

  • Enhancing user experience in conversational agents
  • Refining content generation models

Implementation Example

This example leverages Phoenix:

from phoenix.experiments.evaluators import HelpfulnessEvaluator
from phoenix.evals.models import OpenAIModel

helpfulness_evaluator = HelpfulnessEvaluator(model=OpenAIModel())

Choosing Output Types

Here is a guide on when and where to use specific LLM evaluation metric output types:

Output Type Description Best Use Case
Binary Two categories (correct, incorrect)
Clear-cut, quick evaluations
Multi-class Multiple categories (fully, partially, completely incorrect)
Nuanced assessments
Categorical score Numerical representation of categories
Aggregation and statistical analysis
Continuous score Range of values (i.e. 1-10)
Detailed feedback (use cautiously)

TL;DR: Prefer categorical evaluations for stability and interpretability. Use continuous scores when detailed gradation is necessary, ensuring proper normalization.

Implementing Evaluations: LLM-as-a-Judge vs. Code-Based

LLM-as-a-Judge:

  • Utilizes an LLM to assess outputs.
  • Suitable for subjective or complex evaluations.
  • Requires carefully crafted prompts and may need calibration.

You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Reference]: {context}
    ************
    [Answer]: {sampled_answer}
    [END DATA]
Your response must be a single word, either “correct” or “incorrect”, and should not contain any text or characters aside from that word. “correct” means that the question is correctly and fully answered by the answer. “incorrect” means that the question is not correctly or only partially answered by the answer.

Code-Based Evaluation:

  • Employs deterministic rules or logic.
  • Is ideal for structured output validation.

class ExampleResult(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using All Inputs")
        return(EvaluationResult(score=score, label=label, explanation=explanation)
        
class ExampleScore(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using A float")
        return 1.0
      
class ExampleLabel(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator label")
        return "good"

Composite Metrics and Dashboards

Combining multiple LLM evaluation metrics as composite metrics provides a holistic view of LLM performance. Dashboards can track overall evaluation pass rates, correctness across different contexts, hallucination rates by model version, toxicity occurrences over time, and more.

It can also help to use platforms like Arize Phoenix to create interactive dashboards to visualize these metrics, monitor trends, and identify areas for improvement.

Common Pitfalls

Common problems when leveraging LLM evaluation metrics include:

  • Overfitting to Test Data: Ensure evaluations generalize beyond the test set.
  • Subjective Labeling: Establish clear guidelines to maintain consistency.
  • Inconsistent LLM Judgments: Incorporate few-shot examples or ensemble methods.
  • Ambiguous Definitions: Clearly define evaluation criteria and prompts.

Case Study: Evaluating a RAG System

Step 1: Document Relevance
Assess if retrieved documents are pertinent to the query.

from phoenix.evals import (
    OpenAIModel,
    llm_classify,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
)

# Initialize the evaluation model
model = OpenAIModel(model_name="gpt-4", api_key="your-api-key")

# Define the expected outputs
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())

# Run the relevance evaluation
relevance_results = llm_classify(
    dataframe=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Step 2: Response Faithfulness
Determine if the response is grounded in the provided context.

from phoenix.evals import (
    HALLUCINATION_PROMPT_TEMPLATE,
    HALLUCINATION_PROMPT_RAILS_MAP,
)

# Define the expected outputs
rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())

# Run the faithfulness evaluation
faithfulness_results = llm_classify(
    dataframe=df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)


Step 3: Answer Correctness
Evaluate if the response correctly answers the query.

from phoenix.evals import (
    QA_PROMPT_TEMPLATE,
    QA_PROMPT_RAILS_MAP,
)

# Define the expected outputs
rails = list(QA_PROMPT_RAILS_MAP.values())

# Run the correctness evaluation
correctness_results = llm_classify(
    dataframe=df,
    template=QA_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Combine Results:
Merge evaluation results for comprehensive analysis.

# Add evaluation results to the original DataFrame
df['relevance_label'] = relevance_results['label']
df['faithfulness_label'] = faithfulness_results['label']
df['correctness_label'] = correctness_results['label']

Custom Evaluation Metrics and Benchmarks

While standard benchmarks like HELM or MMLU provide general insights, custom metrics tailored to specific applications offer more relevant evaluations.

Examples of custom metrics:

  • Customer Support: Empathy detection, resolution accuracy
  • Legal Applications: Citation correctness, jurisdiction relevance
  • E-commerce: Product recommendation accuracy, sentiment alignment

To create custom evaluations:

  1. Define the Metric: Clearly specify what you aim to measure.
  2. Develop a Prompt Template: Craft prompts that elicit the desired evaluation.
  3. Run Evaluations: Utilize Phoenix’s tools to assess and analyze results.

Continuous Evaluation and Iteration

LLM applications evolve over time, necessitating ongoing evaluation.

Best practices:

  • Regularly update evaluation datasets.
  • Iterate on prompt templates based on feedback.
  • Compare LLM evaluations with human judgments.
  • Integrate evaluations into CI/CD pipelines for automated monitoring.

Ultimately, effective evaluation of LLMs is crucial for deploying reliable and trustworthy AI applications. By understanding and implementing the appropriate metrics and continuously refining evaluation strategies, organizations can ensure their agents and LLM applications meet the desired performance standards.