Skip to main content

Overview

The Document Relevance evaluator determines whether a retrieved document contains information relevant to answering a specific question. This is essential for evaluating RAG (Retrieval-Augmented Generation) systems where document quality directly impacts response quality.

When to Use

Use the Document Relevance evaluator when you need to:
  • Evaluate RAG retrieval quality - Assess whether your retrieval system is returning useful documents
  • Debug poor RAG responses - Identify if issues stem from retrieval vs generation
  • Compare retrieval strategies - Test different embedding models, chunking strategies, or search algorithms
  • Monitor retrieval in production - Track document relevance over time
This evaluator assesses individual document relevance to a query. For evaluating whether a response is faithful to its context, use the Faithfulness evaluator instead.

Supported Levels

The level of an evaluator determines the scope of the evaluation in OpenTelemetry terms. Some evaluations are applicable to individual spans, some to full traces or sessions, and some are applicable at multiple levels.
LevelSupportedNotes
SpanYesBest for retriever spans. Evaluate each retrieved document individually.
Relevant span kinds: Retriever spans, embedding spans, or any span that retrieves documents from a knowledge base.

Input Requirements

The Document Relevance evaluator requires two inputs:
FieldTypeDescription
inputstringThe user’s query or question
document_textstringThe document text to evaluate for relevance
In TypeScript, the field is named documentText (camelCase) instead of document_text (snake_case).

Formatting Tips

For best results:
  • Evaluate one document at a time - Run the evaluator separately for each retrieved document
  • Use the full document chunk - Include the complete text that was retrieved, not just snippets
  • Include metadata if helpful - Document titles or sources can provide useful context

Output Interpretation

The evaluator returns a Score object with the following properties:
PropertyValueDescription
label"relevant" or "unrelated"Classification result
score1.0 or 0.0Numeric score (1.0 = relevant, 0.0 = unrelated)
explanationstringLLM-generated reasoning for the classification
direction"maximize"Higher scores are better
metadataobjectAdditional information such as the model name. When tracing is enabled, includes the trace_id for the evaluation.
Interpretation:
  • Relevant (1.0): The document contains information that can help answer the question
  • Unrelated (0.0): The document does not contain relevant information for the question

Usage Examples

from phoenix.evals import LLM
from phoenix.evals.metrics import DocumentRelevanceEvaluator

# Initialize the LLM client
llm = LLM(provider="openai", model="gpt-4o")

# Create the evaluator
relevance_eval = DocumentRelevanceEvaluator(llm=llm)

# Inspect the evaluator's requirements
print(relevance_eval.describe())

# Evaluate a single document
eval_input = {
    "input": "What is the capital of France?",
    "document_text": "Paris is the capital and largest city of France."
}

scores = relevance_eval.evaluate(eval_input)
print(scores[0])
# Score(name='document_relevance', score=1.0, label='relevant', ...)

Evaluating Multiple Documents

To evaluate all documents returned by a retriever, iterate over each document:
from phoenix.evals import LLM
from phoenix.evals.metrics import DocumentRelevanceEvaluator

llm = LLM(provider="openai", model="gpt-4o")
relevance_eval = DocumentRelevanceEvaluator(llm=llm)

query = "What are the symptoms of COVID-19?"
retrieved_documents = [
    "COVID-19 symptoms include fever, cough, and fatigue.",
    "The history of coronaviruses dates back to the 1960s.",
    "Treatment options for COVID-19 include antiviral medications."
]

# Evaluate each document
for i, doc in enumerate(retrieved_documents):
    scores = relevance_eval.evaluate({
        "input": query,
        "document_text": doc
    })
    print(f"Document {i+1}: {scores[0].label} ({scores[0].score})")

# Document 1: relevant (1.0)
# Document 2: unrelated (0.0)
# Document 3: unrelated (0.0)

Using Input Mapping

When your data has different field names, use input mapping.
from phoenix.evals import LLM
from phoenix.evals.metrics import DocumentRelevanceEvaluator

llm = LLM(provider="openai", model="gpt-4o")
relevance_eval = DocumentRelevanceEvaluator(llm=llm)

# Example with different field names
eval_input = {
    "query": "What is machine learning?",
    "chunk": "Machine learning is a subset of AI that enables systems to learn from data."
}

# Use input mapping to match expected field names
input_mapping = {
    "input": "query",
    "document_text": "chunk"
}

scores = relevance_eval.evaluate(eval_input, input_mapping)
For more details on input mapping options, see Input Mapping.

Configuration

For LLM client configuration options, see Configuring the LLM.

Viewing and Modifying the Prompt

You can view the latest versions of our prompt templates on GitHub. The evaluators are designed to work well in a variety of contexts, but we highly recommend modifying the prompt to be more specific to your use case. Feel free to adapt them.
from phoenix.evals.metrics import DocumentRelevanceEvaluator
from phoenix.evals import LLM, ClassificationEvaluator

llm = LLM(provider="openai", model="gpt-4o")
evaluator = DocumentRelevanceEvaluator(llm=llm)

# View the prompt template
print(evaluator.prompt_template)

# Create a custom evaluator based on the built-in template
custom_evaluator = ClassificationEvaluator(
    name="document_relevance",
    prompt_template=evaluator.prompt_template,  # Modify as needed
    llm=llm,
    choices={"relevant": 1.0, "unrelated": 0.0},
    direction="maximize",
)

Using with Phoenix

Evaluating Traces

Run evaluations on traces collected in Phoenix and log results as annotations:

Running Experiments

Use the Document Relevance evaluator in Phoenix experiments:

API Reference