Batch Evaluations
Dataframe Evaluation Methods
evaluate_dataframe
for synchronous dataframe evaluationsasync_evaluate_dataframe
an asynchronous version for optimized speed and ability to specify concurrency.
Both methods run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:
{score_name}_score
contains the JSON serialized score (or None if the evaluation failed){evaluator_name}_execution_details
contains information about the execution status, duration, and any exceptions that occurred.
Notes:
Bind
input_mappings
to your evaluators beforehand so they match your dataframe columns.Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.
Examples
Evaluator with more than one score returned:
import pandas as pd
from phoenix.evals import evaluate_dataframe
from phoenix.evals.metrics import PrecisionRecallFScore
precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes")
df = pd.DataFrame(
{
"output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]],
"expected": [["Yes", "No", "No"], ["Yes", "No", "No"]],
}
)
result = evaluate_dataframe(df, [precision_recall_fscore])
result.head()
Running multiple evaluators, one bound with an input_mapping:
from phoenix.evals import bind_evaluator, evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator, exact_match
df = pd.DataFrame(
{
# exact_match columns
"output": ["Yes", "Yes", "No"],
"expected": ["Yes", "No", "No"],
# hallucination columns (need mapping)
"context": ["This is a test", "This is another test", "This is a third test"],
"query": [
"What is the name of this test?",
"What is the name of this test?",
"What is the name of this test?",
],
"response": ["First test", "Another test", "Third test"],
}
)
llm = LLM(provider="openai", model="gpt-4o")
hallucination_evaluator = bind_evaluator(
HallucinationEvaluator(llm=llm), {"input": "query", "output": "response"}
)
result = evaluate_dataframe(df, [exact_match, hallucination_evaluator])
result.head()
Asynchronous evaluation
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator
from phoenix.evals import async_evaluate_dataframe
df = pd.DataFrame(
{
"context": ["This is a test", "This is another test", "This is a third test"],
"input": [
"What is the name of this test?",
"What is the name of this test?",
"What is the name of this test?",
],
"output": ["First test", "Another test", "Third test"],
}
)
llm = LLM(provider="openai", model="gpt-4o")
hallucination_evaluator = HallucinationEvaluator(llm=llm)
result = await async_evaluate_dataframe(df, [hallucination_evaluator], concurrency=5)
result.head()
See Using Evals with Phoenix to learn how to run evals on project traces and upload them to Phoenix.
Last updated
Was this helpful?