Batch Evaluations

Dataframe Evaluation Methods (Python only)

evaluate_dataframe for synchronous dataframe evaluations
async_evaluate_dataframe an asynchronous version for optimized speed and ability to specify concurrency.

Both methods run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:

{score_name}_score contains the JSON serialized score (or None if the evaluation failed)
{evaluator_name}_execution_details contains information about the execution status, duration, and any exceptions that occurred.

Notes:

Bind input_mappings to your evaluators beforehand so they match your dataframe columns.
Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.

Examples

Evaluator with more than one score returned:

import pandas as pd

from phoenix.evals import evaluate_dataframe
from phoenix.evals.metrics import PrecisionRecallFScore

precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes")

df = pd.DataFrame(
    {
        "output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]],
        "expected": [["Yes", "No", "No"], ["Yes", "No", "No"]],
    }
)

result = evaluate_dataframe(dataframe=df, evaluators=[precision_recall_fscore])
result.head()

Running multiple evaluators, one bound with an input_mapping:

from phoenix.evals import bind_evaluator, evaluate_dataframe
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator, exact_match

df = pd.DataFrame(
    {
        # exact_match columns
        "output": ["Yes", "Yes", "No"],
        "expected": ["Yes", "No", "No"],
        # faithfulness columns (need mapping)
        "context": ["This is a test", "This is another test", "This is a third test"],
        "query": [
            "What is the name of this test?",
            "What is the name of this test?",
            "What is the name of this test?",
        ],
        "response": ["First test", "Another test", "Third test"],
    }
)

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_evaluator = bind_evaluator(
    FaithfulnessEvaluator(llm=llm), {"input": "query", "output": "response"}
)

result = evaluate_dataframe(dataframe=df, evaluators=[exact_match, faithfulness_evaluator])
result.head()

Asynchronous evaluation

from phoenix.evals.llm import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import async_evaluate_dataframe

df = pd.DataFrame(
    {
        "context": ["This is a test", "This is another test", "This is a third test"],
        "input": [
            "What is the name of this test?",
            "What is the name of this test?",
            "What is the name of this test?",
        ],
        "output": ["First test", "Another test", "Third test"],
    }
)

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_evaluator = FaithfulnessEvaluator(llm=llm)

result = await async_evaluate_dataframe(dataframe=df, evaluators=[faithfulness_evaluator], concurrency=5)
result.head()

See Using Evals with Phoenix to learn how to run evals on project traces and upload them to Phoenix.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Batch Evaluations

Dataframe Evaluation Methods (Python only)

Notes:

Examples

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

​Dataframe Evaluation Methods (Python only)

​Notes:

​Examples

Dataframe Evaluation Methods (Python only)

Notes:

Examples