Batch Evaluations

Dataframe Evaluation Methods

  • evaluate_dataframe for synchronous dataframe evaluations

  • async_evaluate_dataframe an asynchronous version for optimized speed and ability to specify concurrency.

Both methods run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:

  1. {score_name}_score contains the JSON serialized score (or None if the evaluation failed)

  2. {evaluator_name}_execution_details contains information about the execution status, duration, and any exceptions that occurred.

Notes:

  • Bind input_mappings to your evaluators beforehand so they match your dataframe columns.

  • Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.

Examples

  1. Evaluator with more than one score returned:

import pandas as pd

from phoenix.evals import evaluate_dataframe
from phoenix.evals.metrics import PrecisionRecallFScore

precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes")

df = pd.DataFrame(
    {
        "output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]],
        "expected": [["Yes", "No", "No"], ["Yes", "No", "No"]],
    }
)

result = evaluate_dataframe(df, [precision_recall_fscore])
result.head()
  1. Running multiple evaluators, one bound with an input_mapping:

from phoenix.evals import bind_evaluator, evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator, exact_match

df = pd.DataFrame(
    {
        # exact_match columns
        "output": ["Yes", "Yes", "No"], 
        "expected": ["Yes", "No", "No"], 
        # hallucination columns (need mapping)
        "context": ["This is a test", "This is another test", "This is a third test"],
        "query": [ 
            "What is the name of this test?",
            "What is the name of this test?",
            "What is the name of this test?",
        ],
        "response": ["First test", "Another test", "Third test"],
    }
)

llm = LLM(provider="openai", model="gpt-4o")
hallucination_evaluator = bind_evaluator(
    HallucinationEvaluator(llm=llm), {"input": "query", "output": "response"}
)

result = evaluate_dataframe(df, [exact_match, hallucination_evaluator])
result.head()
  1. Asynchronous evaluation

from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator
from phoenix.evals import async_evaluate_dataframe

df = pd.DataFrame(
    {
        "context": ["This is a test", "This is another test", "This is a third test"],
        "input": [ 
            "What is the name of this test?",
            "What is the name of this test?",
            "What is the name of this test?",
        ],
        "output": ["First test", "Another test", "Third test"],
    }
)

llm = LLM(provider="openai", model="gpt-4o")
hallucination_evaluator = HallucinationEvaluator(llm=llm)

result = await async_evaluate_dataframe(df, [hallucination_evaluator], concurrency=5)
result.head()

See Using Evals with Phoenix to learn how to run evals on project traces and upload them to Phoenix.

Last updated

Was this helpful?