Use Cases
Chapter Summary
The final chapter highlights specific evaluation use cases, including agent evaluations and Retrieval-Augmented Generation (RAG) systems. Practical examples illustrate how to assess tool usage, retrieval accuracy, and response appropriateness, ensuring holistic evaluation coverage.
Evaluate agents and RAG systems effectively with best practices outlined in our product documentation.
Evaluating Agents
Building evaluators for each step
Evaluating the skill steps of an agent is similar to evaluating those skills outside of the agent. If your agent has a RAG skill, for example, you would still evaluate both the retrieval and response generation steps, calculating metrics like document relevance and hallucinations in the response.
Unique considerations when evaluating agents
Beyond skills, agent evaluation becomes more unique.
In addition to evaluating the agent’s skills, you need to evaluate the router and the path the agent takes.
The router should be evaluated on two axes: first, its ability to choose the right skill or function for a given input; second, its ability to extract the right parameters from the input to populate the function call.
Choosing the right skill is perhaps the most important task and one of the most difficult. This is where your router prompt (if you have one) will be put to the test. Low scores at this stage usually stem from a poor router prompt or unclear function descriptions, both of which are challenging to improve.
Extracting the right parameters is also tricky, especially when parameters overlap. Consider adding some curveballs into your test cases, like a user asking for an order status while providing a shipping tracking number, to stress-test your agent.
Arize provides built-in evaluators to measure tool call accuracy using an LLM as a judge, which can assist at this stage.
TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of tools provided below. It is your job to decide whether that agent chose the right tool to call.
[BEGIN DATA]
************
[Question]: {question}
************
[Tool Called]: {tool_call}
[END DATA]
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.
"correct" means the correct tool call was chosen, the correct parameters were extracted from the question, the tool call generated is runnable and correct, and that no outside information not present in the question was used in the generated question.
[Tool Definitions]: {tool_definitions}
"""
Lastly, evaluate the path the agent takes during execution. Does it repeat steps? Get stuck in loops? Return to the router unnecessarily? These “path errors” can cause the worst bugs in agents.
To evaluate the path, we recommend adding an iteration counter as an evaluation. Tracking the number of steps it takes for the agent to complete different types of queries can provide a useful statistic.
However, the best way to debug agent paths is by manually inspecting traces. Especially early in development, using an observability platform and manually reviewing agent executions will provide valuable insights for improvement.
Retrieval Augmented Generation (RAG) Evaluation
RAG applications need to be evaluated on two critical aspects.
1. Retrieval Evaluation: To assess the accuracy and relevance of the documents that were retrieved. Examples:
Groundedness or Faithfulness | The extent or faithfulness to which the LLM’s response aligns with the retrieved context. | Binary classification (faithful/unfaithful) |
---|---|---|
Context relevance | Gauges how relevant the retrieved context supports the user’s query. | Binary classification (faithful/unfaithful). Ranking metrics: Mean Reciprocal Rank (MRR), Precision @ K, Mean Average Precision (MAP), Hit rate, Normalized Discounted Cumulative Gain (NDCG) |
You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:
[BEGIN DATA]
************
[Question]: {query}
************
[Reference text]: {reference}
[END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text contains information that can answer the Question. Please focus on whether the very specific question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated", and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.
2. Response Evaluation: To measure the appropriateness of the response generated by the system when the context was provided. Examples:
Ground Truth-Based Metrics | The extent or faithfulness to which the LLM’s response aligns with the retrieved context. | Accuracy, Precision, Recall, F1 score |
---|---|---|
Answer Relevance | Gauges how relevant the retrieved context supports the user’s query. | Binary classification (Relevant/Irrelevant) |
QA Correctness | Detects whether a question was correctly answered by the system based on the retrieved data. | Binary classification (Correct/Incorrect) |
Hallucinations | To detect LLM hallucinations relative to retrieved context. | Binary classification (Factual/Hallucinated) |
You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data:
[BEGIN DATA]
************
[Question]: {question}
************
[Reference]: {context}
************
[Answer]: {sampled_answer}
[END DATA]
Your response must be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the answer.
Download this article
Join the Arize community and continue your journey into LLM evaluation.