Pydantic Evals
Use Pydantic Evals to evaluate your LLM app for a simple question-answering task.
Log your results to Arize to track your experiments and traces.
Install dependencies
!pip install -q pydantic-evals "arize[Tracing]" arize-otel openai openinference-instrumentation-openai
Setup API keys and imports
from openai import OpenAI
from pydantic_evals import Case, Dataset
from getpass import getpass
import os
SPACE_ID = globals().get("SPACE_ID") or getpass(
"🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
"🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
Setup Arize
Add our auto-instrumentation for OpenAI using arize-otel.
from arize.otel import register
tracer_provider = register(
space_id=SPACE_ID,
api_key=API_KEY,
project_name="pydantic-evals-tutorial",
)
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Define the Evaluation Dataset
Create a dataset of test cases using Pydantic Evals for a question-answering task.
Each Case represents a single test with an input (question) and an expected output (answer).
The Dataset aggregates these cases for evaluation.
cases = [
Case(name="capital of France", inputs="What is the capital of France?", expected_output="Paris"),
Case(name="author of Romeo and Juliet", inputs="Who wrote Romeo and Juliet?", expected_output="William Shakespeare"),
Case(name="largest planet", inputs="What is the largest planet in our solar system?", expected_output="Jupiter")
]
dataset = Dataset(cases=cases)
Setup LLM task to evaluate
client = OpenAI(api_key=OPENAI_API_KEY)
def evaluate_case(case):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": case.inputs}]
)
output = response.choices[0].message.content
print(output)
is_correct = case.expected_output.lower() in output.strip().lower()
return is_correct
Run your experiment and evaluation
results = [evaluate_case(case) for case in dataset.cases]
for case, result in zip(dataset.cases, results):
print(f"Case: {case.name}, Correct: {result}")
View results in Arize

Last updated
Was this helpful?