1 of 1

Relevance Classification Evaluation

Evaluate the relevance of documents retrieved by RAG applications using Phoenix's evaluation framework.

This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels.

Key Points:

Download and prepare benchmark datasets for relevance classification
Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
Analyze results with confusion matrices and detailed reports
Get explanations for LLM classifications to understand decision-making
Measure retrieval quality using ranking metrics like precision@k

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.

Download Benchmark Dataset

df = download_benchmark_dataset(
    task="binary-relevance-classification", 
    dataset_name="wiki_qa-train"
)

Configure Evaluation

N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
    "query_text": "input",
    "document_text": "reference",
})

Run Relevance Classification

model = OpenAIModel(model="gpt-4", temperature=0.0)
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())

relevance_classifications = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

Evaluate Results

true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

Get Explanations

relevance_classifications_df = llm_classify(
    dataframe=df_sample.sample(n=5),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True,
    concurrency=20,
)

Compare Models

Run the same evaluation with different models:

# GPT-3.5
model_gpt35 = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0)

# GPT-4 Turbo
model_gpt4turbo = OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0)