Evaluate the relevance of documents retrieved by RAG applications using Phoenix's evaluation framework.
This tutorial shows how to classify documents as relevant or irrelevant to queries using benchmark datasets with ground-truth labels.
Key Points:
Download and prepare benchmark datasets for relevance classification
Compare different LLM models (GPT-4, GPT-3.5, GPT-4 Turbo) for classification accuracy
Analyze results with confusion matrices and detailed reports
Get explanations for LLM classifications to understand decision-making
Measure retrieval quality using ranking metrics like precision@k
We will go through key code snippets on this page. To follow the full tutorial, check out the full notebook.
df = download_benchmark_dataset(
task="binary-relevance-classification",
dataset_name="wiki_qa-train"
)
N_EVAL_SAMPLE_SIZE = 100
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(columns={
"query_text": "input",
"document_text": "reference",
})
model = OpenAIModel(model="gpt-4", temperature=0.0)
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
dataframe=df_sample,
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
model=model,
rails=rails,
concurrency=20,
)["label"].tolist()
true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()
print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
cmap=plt.colormaps["Blues"],
number_label=True,
normalized=True,
)
relevance_classifications_df = llm_classify(
dataframe=df_sample.sample(n=5),
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
model=model,
rails=rails,
provide_explanation=True,
concurrency=20,
)
Run the same evaluation with different models:
# GPT-3.5
model_gpt35 = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0)
# GPT-4 Turbo
model_gpt4turbo = OpenAIModel(model="gpt-4-turbo-preview", temperature=0.0)