Retrieval (RAG) Relevance
When To Use RAG Eval Template
This Eval evaluates whether a retrieved chunk contains an answer to the query. It's extremely useful for evaluating retrieval systems.
RAG Eval Template
You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
[BEGIN DATA]
************
[Question]: {query}
************
[Reference text]: {reference}
[END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "irrelevant",
and should not contain any text or characters aside from that word.
"irrelevant" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.
Benchmark Results
GPT-4 Result

GPT-3.5 Results

Claude V2 Results

How To Run the Eval
from phoenix.experimental.evals import (
RAG_RELEVANCY_PROMPT_RAILS_MAP,
RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
OpenAIModel,
download_benchmark_dataset,
llm_classify,
)
model = OpenAIModel(
model_name="gpt-4",
temperature=0.0,
)
#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
dataframe=df,
template=RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
model=model,
rails=rails,
)
The above runs the RAG relevancy LLM template against the dataframe df.
RAG Eval
GPT-4
GPT-3.5
Palm (Text Bison)
Claude V2
Precision
0.70
0.42
0.53
0.79
Recall
0.88
1.0
1
0.22
F1
0.78
0.59
0.69
0.34
Last updated
Was this helpful?