1 of 16

Pre-Built Evals

The following are simple functions on top of the LLM evals building blocks that are pre-tested with benchmark data.

All evals templates are tested against golden data that are available as part of the LLM eval library's benchmarked data and target precision at 70-90% and F1 at 70-85%.

Supported Models.

The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.

model = OpenAIModel(model_name="gpt-4",temperature=0.6)
model("What is the largest costal city in France?")

We currently support a growing set of models for LLM Evals, please check out the Eval Models section for usage.

Hallucinations

When To Use Hallucination Eval Template

This LLM Eval detects if the output of a model is a hallucination based on contextual data.

This Eval is specifically designed to detect hallucinations in generated answers from private or retrieved data. The Eval detects if an AI answer to a question is a hallucination based on the reference data used to generate the answer.

This Eval is designed to check for hallucinations on private data, specifically on data that is fed into the context window from retrieval.

It is not designed to check hallucinations on what the LLM was trained on. It is not useful for random public fact hallucinations. E.g. "What was Michael Jordan's birthday?"

Hallucination Eval Template

In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.

    # Query: {query}
    # Reference text: {reference}
    # Answer: {response}
    Is the answer above factual or hallucinated based on the query and reference text?

We are continually iterating our templates, view the most up-to-date template on GitHub.

How To Run the Hallucination Eval

from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned 
rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())
hallucination_classifications = llm_classify(
    dataframe=df, 
    template=HALLUCINATION_PROMPT_TEMPLATE, 
    model=model, 
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

The above Eval shows how to the the hallucination template for Eval detection.

Benchmark Results

This benchmark was obtained using notebook below. It was run using the HaluEval QA Dataset as a ground truth dataset. Each example in the dataset was evaluating using the HALLUCINATION_PROMPT_TEMPLATE above, then the resulting labels were compared against the is_hallucination label in the HaluEval dataset to generate the confusion matrices below.

GPT-4 Results

Eval

GPT-4

Precision

0.93

Recall

0.72

0.82

Throughput

GPT-4

100 Samples

105 sec

Q&A on Retrieved Data

When To Use Q&A Eval Template

This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

question: This is the question the Q&A system is running against
sampled_answer: This is the answer from the Q&A system.
context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer

Q&A Eval Template

We are continually iterating our templates, view the most up-to-date template .

How To Run the Q&A Eval

The above Eval uses the QA template for Q&A analysis on retrieved data.

Benchmark Results

The used was created based on:

Squad 2: The 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0) allows researchers to design AI models for reading comprehension tasks under challenging constraints.
Supplemental Data to Squad 2: In order to check the case of detecting incorrect answers, we created wrong answers based on the context data. The wrong answers are intermixed with right answers.

Each example in the dataset was evaluating using the QA_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth in the benchmarking dataset.

GPT-4 Results

Q&A Eval

GPT-4o

GPT-4

Throughput

GPT-4

Retrieval (RAG) Relevance

When To Use RAG Eval Template

This Eval evaluates whether a retrieved chunk contains an answer to the query. It's extremely useful for evaluating retrieval systems.

RAG Eval Template

We are continually iterating our templates, view the most up-to-date template .

How To Run the RAG Relevance Eval

The above runs the RAG relevancy LLM template against the dataframe df.

Benchmark Results

This benchmark was obtained using notebook below. It was run using the as a ground truth dataset. Each example in the dataset was evaluating using the RAG_RELEVANCY_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the WikiQA dataset to generate the confusion matrices below.

GPT-4 Result

RAG Eval

GPT-4o

GPT-4

Throughput

GPT-4

Summarization

When To Use Summarization Eval Template

This Eval helps evaluate the summarization results of a summarization task. The template variables are:

document: The document text to summarize
summary: The summary of the document

Summarization Eval Template

You are comparing the summary text and it's original document and trying to determine
if the summary is good. Here is the data:
    [BEGIN DATA]
    ************
    [Summary]: {output}
    ************
    [Original Document]: {input}
    [END DATA]
Compare the Summary above to the Original Document and determine if the Summary is
comprehensive, concise, coherent, and independent relative to the Original Document.
Your response must be a single word, either "good" or "bad", and should not contain any text
or characters aside from that. "bad" means that the Summary is not comprehensive,
concise, coherent, and independent relative to the Original Document. "good" means the
Summary is comprehensive, concise, coherent, and independent relative to the Original Document.

We are continually iterating our templates, view the most up-to-date template on GitHub.

How To Run the Summarization Eval

import phoenix.evals.default_templates as templates
from phoenix.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned 
rails = list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())
summarization_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.SUMMARIZATION_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

The above shows how to use the summarization Eval template.

Benchmark Results

This benchmark was obtained using notebook below. It was run using a Daily Mail CNN summarization dataset as a ground truth dataset. Each example in the dataset was evaluating using the SUMMARIZATION_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the summarization dataset to generate the confusion matrices below.

GPT-4 Results

Eval

GPT-4o

GPT-4

Precision

0.87

0.79

Recall

0.63

0.88

0.73

0.83

Code Generation

When To Use Code Generation Eval Template

This Eval checks the correctness and readability of the code from a code generation process. The template variables are:

query: The query is the coding question being asked
code: The code is the code that was returned.

Code Generation Eval Template

You are a stern but practical senior software engineer who cares a lot about simplicity and
readability of code. Can you review the following code that was written by another engineer?
Focus on readability of the code. Respond with "readable" if you think the code is readable,
or "unreadable" if the code is unreadable or needlessly complex for what it's trying
to accomplish.

ONLY respond with "readable" or "unreadable"

Task Assignment:
```
{query}
```

Implementation to Evaluate:
```
{code}
```

We are continually iterating our templates, view the most up-to-date template on GitHub.

How To Run the Code Generation Eval

from phoenix.evals import (
    CODE_READABILITY_PROMPT_RAILS_MAP,
    CODE_READABILITY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned 
rails = list(CODE_READABILITY_PROMPT_RAILS_MAP.values())
readability_classifications = llm_classify(
    dataframe=df,
    template=CODE_READABILITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

The above shows how to use the code readability template.

Benchmark Results

This benchmark was obtained using notebook below. It was run using an OpenAI Human Eval dataset as a ground truth dataset. Each example in the dataset was evaluating using the CODE_READABILITY_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.

GPT-4 Results

Code Eval

GPT-4

Precision

0.93

Recall

0.78

0.85

Toxicity

When To Use Toxicity Eval Template

The following shows the results of the toxicity Eval on a toxic dataset test to identify if the AI response is racist, biased, or toxic. The template variables are:

text: the text to be classified

Toxicity Eval Template

We are continually iterating our templates, view the most up-to-date template .

How To Run the Toxicity Eval

Benchmark Results

This benchmark was obtained using notebook below. It was run using the as a ground truth dataset. Each example in the dataset was evaluating using the TOXICITY_PROMPT_TEMPLATE above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.

GPT-4 Results

Note: Palm is not useful for Toxicity detection as it always returns "" string for toxic inputs

Toxicity Eval

GPT-4o

GPT-4

AI vs Human (Groundtruth)

This LLM evaluation is used to compare AI answers to Human answers. Its very useful in RAG system benchmarking to compare the human generated groundtruth.

A workflow we see for high quality RAG deployments is generating a golden dataset of questions and a high quality set of answers. These can be in the range of 100-200 but provide a strong check for the AI generated answers. This Eval checks that the human ground truth matches the AI generated answer. Its designed to catch missing data in "half" answers and differences of substance.

Example Human vs AI on Arize Docs:

Question:

What Evals are supported for LLMs on generative models?

Human:

Arize supports a suite of Evals available from the Phoenix Evals library, they include both pre-tested Evals and the ability to configure cusotm Evals. Some of the pre-tested LLM Evals are listed below:

Retrieval Relevance, Question and Answer, Toxicity, Human Groundtruth vs AI, Citation Reference Link Relevancy, Code Readability, Code Execution, Hallucination Detection and Summarizaiton

AI:

Arize supports LLM Evals.

Eval:

Incorrect

Explanation of Eval:

The AI answer is very brief and lacks the specific details that are present in the human ground truth answer. While the AI answer is not incorrect in stating that Arize supports LLM Evals, it fails to mention the specific types of Evals that are supported, such as Retrieval Relevance, Question and Answer, Toxicity, Human Groundtruth vs AI, Citation Reference Link Relevancy, Code Readability, Hallucination Detection, and Summarization. Therefore, the AI answer does not fully capture the substance of the human answer.

Overview of template:

We are continually iterating our templates, view the most up-to-date template on GitHub.

print(HUMAN_VS_AI_PROMPT_TEMPLATE)

You are comparing a human ground truth answer from an expert to an answer from an AI model.
Your goal is to determine if the AI answer correctly matches, in substance, the human answer.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Human Ground Truth Answer]: {correct_answer}
    ************
    [AI Answer]: {ai_generated_answer}
    ************
    [END DATA]
Compare the AI answer to the human ground truth answer, if the AI correctly answers the question,
then the AI answer is "correct". If the AI answer is longer but contains the main idea of the
Human answer please answer "correct". If the AI answer diverges or does not contain the main
idea of the human answer, please answer "incorrect".

How to run the Human vs AI Eval:

from phoenix.evals import (
    HUMAN_VS_AI_PROMPT_RAILS_MAP,
    HUMAN_VS_AI_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# The rails are used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=HUMAN_VS_AI_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    verbose=False,
    provide_explanation=True
)

Benchmark Results:

The follow benchmarking data was gathered by comparing various model results to ground truth data. The ground truth data used was a handcrafted dataset consisting of questions about the Arize platform. That dataset is availabe here.

GPT-4 Results

GPT-4o

GPT-4

Precision

0.90

0.92

Recall

0.56

0.74

0.69

0.82

Reference (citation) Link

Reference Links in Retrieval Q&A

In chatbots and Q&A systems, many times reference links are provided in the response, along with an answer, to help point users to documentation or pages that contain more information or the source for the answer.

EXAMPLE: Q&A from Arize-Phoenix Documentation

QUESTION: What other models does Arize Phoenix support beyond OpenAI for running Evals?

ANSWER: Phoenix does support a large set of LLM models through the model object. Phoenix supports OpenAI (GPT-4, GPT-4-32k, GPT-3.5 Turbo, GPT-3.5 Instruct, etc...), Azure OpenAI, Google Palm2 Text Bison, and All AWS Bedrock models (Claude, Mistral, etc...).

REFERENCE LINK: https://arize.com/docs/phoenix/api/evaluation-models

This Eval checks the reference link returned answers the question asked in a conversation

We are continually iterating our templates, view the most up-to-date template on GitHub.

print(REF_LINK_EVAL_PROMPT_TEMPLATE_STR)

You are given a conversation that contains questions by a CUSTOMER and you are trying
to determine if the documentation page shared by the ASSISTANT correctly answers
the CUSTOMERS questions. We will give you the conversation between the customer
and the ASSISTANT and the text of the documentation returned:
    [CONVERSATION AND QUESTION]:
    {conversation}
    ************
    [DOCUMENTATION URL TEXT]:
    {document_text}
    [DOCUMENTATION URL TEXT]:
You should respond "correct" if the documentation text answers the question the
CUSTOMER had in the conversation. If the documentation roughly answers the question
even in a general way the please answer "correct". If there are multiple questions and a single
question is answered, please still answer "correct". If the text does not answer the
question in the conversation, or doesn't contain information that would allow you
to answer the specific question please answer "incorrect".

How to run the Citation Eval

from phoenix.evals import (
    REF_LINK_EVAL_PROMPT_RAILS_MAP,
    REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails are used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(REF_LINK_EVAL_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

Benchmark Results

This benchmark was obtained using notebook below. It was run using a handcrafted ground truth dataset consisting of questions on the Arize platform. That dataset is available here.

Each example in the dataset was evaluating using the REF_LINK_EVAL_PROMPT_TEMPLATE_STR above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.

GPT-4 Results

Reference Link Evals

GPT-4o

Precision

0.96

Recall

0.79

0.87

User Frustration

Teams that are using conversation bots and assistants desire to know whether a user interacting with the bot is frustrated. The user frustration evaluation can be used on a single back and forth or an entire span to detect whether a user has become frustrated by the conversation.

User Frustration Eval Template

We are continually iterating our templates, view the most up-to-date template .

The following is an example of code snippet showing how to use the eval above template:

SQL Generation Eval

SQL Generation is a common approach to using an LLM. In many cases the goal is to take a human description of the query and generate matching SQL to the human description.

Example of a Question: How many artists have names longer than 10 characters?

Example Query Generated:

SELECT COUNT(ArtistId) \nFROM artists \nWHERE LENGTH(Name) > 10

The goal of the SQL generation Evaluation is to determine if the SQL generated is correct based on the question asked.

SQL Eval Template

We are continually iterating our templates, view the most up-to-date template .

Running an SQL Generation Eval

Agent Function Calling Eval

The Agent Function Call eval can be used to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code.

Function Calling Eval Template

TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""

We are continually iterating our templates, view the most up-to-date template on GitHub.

Running an Agent Eval using the Function Calling Template

from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Parameters:

df - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template:
- question - the query made to the model. If you've to evaluate, this will the llm.input_messages column in your exported data.
- tool_call - information on the tool called and parameters included. If you've exported spans from Phoenix to evaluate, this will be the llm.function_call column in your exported data.

Parameter Extraction Only

This template instead evaluates only the parameter extraction step of a router:

You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the parameters in the generated function against the JSON provided below.
The parameters extracted from the question must match the JSON below exactly.
Your response must be single word, either "correct", "incorrect", or "not-applicable",
and should not contain any text or characters aside from that word.

"correct" means the function call parameters match the JSON below and provides only relevant information.
"incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question. You should also respond with "incorrect" if the response makes up information that is not in the JSON schema.
"not-applicable" means that response was not a function call.

Here is more information on each function:
{function_defintions}

Agent Path Convergence

When your agents take multiple steps to get to an answer or resolution, it's important to evaluate the pathway it took to get there. You want most of your runs to be consistent and not take unnecessarily frivolous or wrong actions.

One way of doing this is to calculate convergence:

Run your agent on a set of similar queries
Record the number of steps taken for each
Calculate the convergence score: avg(minimum steps taken / steps taken for this run)

This will give a convergence score of 0-1, with 1 being a perfect score.

Agent Planning

This template evaluates a plan generated by an agent. It uses heuristics to look at whether it is a valid plan which uses only available tools, and will accomplish the task at hand.

Prompt Template

You are an evaluation assistant. Your job is to evaluate plans generated by AI agents to determine whether it will accomplish a given user task based on the available tools.

Here is the data:
    [BEGIN DATA]
    ************
    [User task]: {task}
    ************
    [Tools]: {tool_definitions}
    ************
    [Plan]: {plan}
    [END DATA]

Here is the criteria for evaluation
1. Does the plan include only valid and applicable tools for the task?  
2. Are the tools used in the plan sufficient to accomplish the task?  
3. Will the plan, as outlined, successfully achieve the desired outcome?  
4. Is this the shortest and most efficient plan to accomplish the task?

Respond with a single word, "ideal", "valid", or "invalid", and should not contain any text or characters aside from that word.

"ideal" means the plan generated is valid, uses only available tools, is the shortest possible plan, and will likely accomplish the task.

"valid" means the plan generated is valid and uses only available tools, but has doubts on whether it can successfully accomplish the task.

"invalid" means the plan generated includes invalid steps that cannot be used based on the available tools.

Agent Reflection

Use this prompt template to evaluate an agent's final response. This is an optional step, which you can use as a gate to retry a set of actions if the response or state of the world is insufficient for the given task.

Prompt Template

This prompt template is heavily inspired by the paper: "Self Reflection in LLM Agents".

You are an expert in {topic}. I will give you a user query. Your task is to reflect on your provided solution and whether it has solved the problem.
First, explain whether you believe the solution is correct or incorrect.
Second, list the keywords that describe the type of your errors from most general to most specific.
Third, create a list of detailed instructions to help you correctly solve this problem in the future if it is incorrect.

Be concise in your response; however, capture all of the essential information.

Here is the data:
    [BEGIN DATA]
    ************
    [User Query]: {user_query}
    ************
    [Tools]: {tool_definitions}
    ************
    [State]: {current_state}
    ************
    [Provided Solution]: {solution}
    [END DATA]

Audio Emotion Detection

The Emotion Detection Eval Template is designed to classify emotions from audio files. This evaluation leverages predefined characteristics, such as tone, pitch, and intensity, to detect the most dominant emotion expressed in an audio input. This guide will walk you through how to use the template within the Phoenix framework to evaluate emotion classification models effectively.