Pydantic Evals

  1. Use Pydantic Evals to evaluate your LLM app for a simple question-answering task.

  2. Log your results to Arize to track your experiments and traces.

Install dependencies

!pip install -q pydantic-evals "arize[Tracing]" arize-otel openai openinference-instrumentation-openai

Setup API keys and imports

from openai import OpenAI
from pydantic_evals import Case, Dataset
from getpass import getpass
import os

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Setup Arize

Add our auto-instrumentation for OpenAI using arize-otel.

from arize.otel import register
tracer_provider = register(
    space_id=SPACE_ID,  
    api_key=API_KEY,
    project_name="pydantic-evals-tutorial",  
)

from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Define the Evaluation Dataset

Create a dataset of test cases using Pydantic Evals for a question-answering task.

  1. Each Case represents a single test with an input (question) and an expected output (answer).

  2. The Dataset aggregates these cases for evaluation.

cases = [
    Case(name="capital of France", inputs="What is the capital of France?", expected_output="Paris"),
    Case(name="author of Romeo and Juliet", inputs="Who wrote Romeo and Juliet?", expected_output="William Shakespeare"),
    Case(name="largest planet", inputs="What is the largest planet in our solar system?", expected_output="Jupiter")
]
dataset = Dataset(cases=cases)

Setup LLM task to evaluate

client = OpenAI(api_key=OPENAI_API_KEY)

def evaluate_case(case):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": case.inputs}]
    )
    output = response.choices[0].message.content
    print(output)
    is_correct = case.expected_output.lower() in output.strip().lower()
    return is_correct

Run your experiment and evaluation

results = [evaluate_case(case) for case in dataset.cases]

for case, result in zip(dataset.cases, results):
    print(f"Case: {case.name}, Correct: {result}")

View results in Arize

Last updated

Was this helpful?