Relevance Classification Evaluation
Evaluate the relevance of documents retrieved by RAG applications using Phoenix's evaluation framework.
This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels.
Key Points:
Download and prepare benchmark datasets for relevance classification
Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
Analyze results with confusion matrices and detailed reports
Get explanations for LLM classifications to understand decision-making
Measure retrieval quality using ranking metrics like precision@k
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.
Download Benchmark Dataset
df = download_benchmark_dataset(
task="binary-relevance-classification",
dataset_name="wiki_qa-train"
)
Configure Evaluation
N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
"query_text": "input",
"document_text": "reference",
})
Run Relevance Classification
model = OpenAIModel(model="gpt-4", temperature=0.0)
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
dataframe=df_sample,
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
model=model,
rails=rails,
concurrency=20,
)["label"].tolist()
Evaluate Results
true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()
print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
cmap=plt.colormaps["Blues"],
number_label=True,
normalized=True,
)
Get Explanations
relevance_classifications_df = llm_classify(
dataframe=df_sample.sample(n=5),
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
model=model,
rails=rails,
provide_explanation=True,
concurrency=20,
)
Compare Models
Run the same evaluation with different models:
# GPT-3.5
model_gpt35 = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0)
# GPT-4 Turbo
model_gpt4turbo = OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0)
Last updated
Was this helpful?