1 of 4

Concepts: Retrieval

Retrieval with Embeddings

Overview

Q&A with Retrieval at a Glance

LLM Input: User Query + retrieved document

LLM Output: Response based on query + document

Evaluation Metrics:

Did the LLM answer the question correctly (correctness)
For each retrieved document, is the document relevant to answer the user query?

Possibly the most common use-case for creating a LLM application is to connect an LLM to proprietary data such as enterprise documents or video transcriptions. Applications such as these often times are built on top of LLM frameworks such as Langchain or llama_index, which have first-class support for vector store retrievers. Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company's knowledge base/context to answer specific questions.

How to Evaluate Retrieval Systems

There are varying degrees of how we can evaluate retrieval systems.

Step 1: First we care if the chatbot is correctly answering the user's questions. Are there certain types of questions the chatbot gets wrong more often?

Step 2: Once we know there's an issue, then we need metrics to trace where specifically did it go wrong. Is the issue with retrieval? Are the documents that the system retrieves irrelevant?

Step 3: If retrieval is not the issue, we should check if we even have the right documents to answer the question.

Question

Metric

Pros

Cons

Is this a bad response to the answer?

User feedback or LLM Eval for Q&A

Most relevant way to measure application

Hard to trace down specifically what to fix

Is the retrieved context relevant?

LLM Eval for Relevance

Directly measures effectiveness of retrieval

Requires additional LLMs calls

Is the knowledge base missing areas of user queries?

Query density (drift) - Phoenix generated

Highlights groups of queries with large distance from context

Identifies broad topics missing from knowledge base, but not small gaps

Using Phoenix Traces & Spans

Visualize the chain of the traces and spans for a Q&A chatbot use case. You can click into specific spans.

When clicking into the retrieval span, you can see the relevance score for each document. This can surface irrelevant context.

Using Phoenix Inferences to Analyze RAG (Retrieval Augmented Generation)

Step 1. Identifying Clusters of Bad Responses

Phoenix surfaces up clusters of similar queries that have poor feedback.

Step 2: Irrelevant Documents Being Retrieved

Phoenix can help uncover when irrelevant context is being retrieved using the LLM Evals for Relevance. You can look at a cluster's aggregate relevance metric with precision @k, NDCG, MRR, etc to identify where to improve. You can also look at a single prompt/response pair and see the relevance of documents.

Step 3: Don't Have Any Documents Close Enough

Phoenix can help you identify if there is context that is missing from your knowledge base. By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbots responses.

By setting the "primary" dataset as the user queries, and the "corpus" dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example below.

Troubleshooting Tip:

Found a problematic cluster you want to dig into, but don't want to manually sift through all of the prompts and responses? Ask chatGPT to help you understand the make up of the cluster. Try out the colab here.

Looking for code to get started? Go to our Quickstart guide for Search and Retrieval.

Benchmarking Retrieval

The advent of LLMs is causing a rethinking of the possible architectures of retrieval systems that have been around for decades.

The core use case for RAG (Retrieval Augmented Generation) is the connecting of an LLM to private data, empower an LLM to know your data and respond based on the private data you fit into the context window.

As teams are setting up their retrieval systems understanding performance and configuring the parameters around RAG (type of retrieval, chunk size, and K) is currently a guessing game for most teams.

The above picture shows the a typical retrieval architecture designed for RAG, where there is a vector DB, LLM and an optional Framework.

This section will go through a script that iterates through all possible parameterizations of setting up a retrieval system and use Evals to understand the trade offs.

This overview will run through the scripts in Phoenix for performance analysis of RAG setup:

The scripts above power the included notebook.

Retrieval Performance Analysis

The typical flow of retrieval is a user query is embedded and used to search a vector store for chunks of relevant data.

The core issue of retrieval performance: The chunks returned might or might not be able to answer your main question. They might be semantically similar but not usable to create an answer the question!

The eval template is used to evaluate the relevance of each chunk of data. The Eval asks the main question of "Does the chunk of data contain relevant information to answer the question"?

The Retrieval Eval is used to analyze the performance of each chunk within the ordered list retrieved.

The Evals generated on each chunk can then be used to generate more traditional search and retreival metrics for the retrieval system. We highly recommend that teams at least look at traditional search and retrieval metrics such as:

MRR
Precision @ K
NDCG

These metrics have been used for years to help judge how well your search and retrieval system is returning the right documents to your context window.

These metrics can be used overall, by cluster (UMAP), or on individual decisions, making them very powerful to track down problems from the simplest to the most complex.

Retrieval Evals just gives an idea of what and how much of the "right" data is fed into the context window of your RAG, it does not give an indication if the final answer was correct.

Q&A Evals

The Q&A Evals work to give a user an idea of whether the overall system answer was correct. This is typically what the system designer cares the most about and is one of the most important metrics.

The above Eval shows how the query, chunks and answer are used to create an overall assessment of the entire system.

The above Q&A Eval shows how the Query, Chunk and Answer are used to generate a % incorrect for production evaluations.

Results

The results from the runs will be available in the directory.

Underneath experiment_data there are two sets of metrics:

The first set of results removes the cases where there are 0 retrieved relevant documents. There are cases where some clients test sets have a large number of questions where the documents can not answer. This can skew the metrics a lot.
The second set of results is unfiltered and shows the raw metrics for every retrieval.

The above picture shows the results of benchmark sweeps across your retrieval system setup. The lower the percent the better the results. This is the Q&A Eval.

Retrieval Evals on Document Chunks

Retrieval Evals are designed to evaluate the effectiveness of retrieval systems. The retrieval systems typically return list of chunks of length k ordered by relevancy. The most common retrieval systems in the LLM ecosystem are vector DBs.

The retrieval Eval is designed to asses the relevance of each chunk and its ability to answer the question. More information on the Retrieval Eval can be found here

The picture above shows a single query returning chunks as a list. The retrieval Eval runs across each chunk returning a value of relevance in a list highlighting its relevance for the specific chunk. Phoenix provides helper functions that take in a dataframe, with query column that has lists of chunks and produces a column that is a list of equal length with an Eval for each chunk.

Benchmarking Retrieval

The advent of LLMs is causing a rethinking of the possible architectures of retrieval systems that have been around for decades.

As teams are setting up their retrieval systems understanding performance and configuring the parameters around RAG (type of retrieval, chunk size, and K) is currently a guessing game for most teams.

The above picture shows the a typical retrieval architecture designed for RAG, where there is a vector DB, LLM and an optional Framework.

This section will go through a script that iterates through all possible parameterizations of setting up a retrieval system and use Evals to understand the trade offs.

This overview will run through the scripts in Phoenix for performance analysis of RAG setup:

The scripts above power the included notebook.

Retrieval Performance Analysis

The typical flow of retrieval is a user query is embedded and used to search a vector store for chunks of relevant data.

The eval template is used to evaluate the relevance of each chunk of data. The Eval asks the main question of "Does the chunk of data contain relevant information to answer the question"?

The Retrieval Eval is used to analyze the performance of each chunk within the ordered list retrieved.

MRR
Precision @ K
NDCG

These metrics have been used for years to help judge how well your search and retrieval system is returning the right documents to your context window.

These metrics can be used overall, by cluster (UMAP), or on individual decisions, making them very powerful to track down problems from the simplest to the most complex.

Retrieval Evals just gives an idea of what and how much of the "right" data is fed into the context window of your RAG, it does not give an indication if the final answer was correct.

Q&A Evals

The Q&A Evals work to give a user an idea of whether the overall system answer was correct. This is typically what the system designer cares the most about and is one of the most important metrics.

The above Eval shows how the query, chunks and answer are used to create an overall assessment of the entire system.

The above Q&A Eval shows how the Query, Chunk and Answer are used to generate a % incorrect for production evaluations.

Results

The results from the runs will be available in the directory.

Underneath experiment_data there are two sets of metrics:

The first set of results removes the cases where there are 0 retrieved relevant documents. There are cases where some clients test sets have a large number of questions where the documents can not answer. This can skew the metrics a lot.
The second set of results is unfiltered and shows the raw metrics for every retrieval.

The above picture shows the results of benchmark sweeps across your retrieval system setup. The lower the percent the better the results. This is the Q&A Eval.

Retrieval with Embeddings

Overview

Q&A with Retrieval at a Glance

LLM Input: User Query + retrieved document

LLM Output: Response based on query + document

Evaluation Metrics:

Did the LLM answer the question correctly (correctness)
For each retrieved document, is the document relevant to answer the user query?

How to Evaluate Retrieval Systems

There are varying degrees of how we can evaluate retrieval systems.

Step 1: First we care if the chatbot is correctly answering the user's questions. Are there certain types of questions the chatbot gets wrong more often?

Step 2: Once we know there's an issue, then we need metrics to trace where specifically did it go wrong. Is the issue with retrieval? Are the documents that the system retrieves irrelevant?

Step 3: If retrieval is not the issue, we should check if we even have the right documents to answer the question.

Question

Metric

Pros

Cons

Is this a bad response to the answer?

User feedback or LLM Eval for Q&A

Most relevant way to measure application

Hard to trace down specifically what to fix

Is the retrieved context relevant?

LLM Eval for Relevance

Directly measures effectiveness of retrieval

Requires additional LLMs calls

Is the knowledge base missing areas of user queries?

Query density (drift) - Phoenix generated

Highlights groups of queries with large distance from context

Identifies broad topics missing from knowledge base, but not small gaps

Using Phoenix Traces & Spans

Visualize the chain of the traces and spans for a Q&A chatbot use case. You can click into specific spans.

When clicking into the retrieval span, you can see the relevance score for each document. This can surface irrelevant context.

Using Phoenix Inferences to Analyze RAG (Retrieval Augmented Generation)

Step 1. Identifying Clusters of Bad Responses

Phoenix surfaces up clusters of similar queries that have poor feedback.

Step 2: Irrelevant Documents Being Retrieved

Step 3: Don't Have Any Documents Close Enough

Troubleshooting Tip:

Looking for code to get started? Go to our Quickstart guide for Search and Retrieval.