Pydantic AI Evals

How to use Pydantic Evals with Phoenix to evaluate AI applications using structured evaluation frameworks

Pydantic Evals is an evaluation library that provides preset direct evaluations and LLM Judge evaluations. It can be used to run evaluations over dataframes of cases defined with Pydantic models. This guide shows you how to use Pydantic Evals alongside Arize Phoenix to run evaluations on traces captured from your running application.

Launch Phoenix

Sign up for Phoenix:

Sign up for an Arize Phoenix account at https://app.phoenix.arize.com/login

Install packages:

pip install arize-phoenix-otel

Set your Phoenix endpoint and API Key:

import os

# Add Phoenix API Key for tracing
PHOENIX_API_KEY = "ADD YOUR API KEY"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

Your Phoenix API key can be found on the Keys section of your dashboard.

Install

pip install pydantic-evals arize-phoenix openai openinference-instrumentation-openai

Setup

Enable Phoenix tracing to capture traces from your application:

Basic Usage

1. Generate Traces to Evaluate

First, create some example traces by running your AI application. Here's a simple example:

2. Export Traces from Phoenix

Export the traces you want to evaluate:

3. Define Evaluation Dataset

Create a dataset of test cases using Pydantic Evals:

4. Create Custom Evaluators

Define evaluators to assess your model's performance:

5. Setup Task and Dataset

Create a task that retrieves outputs from your traced data:

6. Add LLM Judge Evaluator

For more sophisticated evaluation, add an LLM judge:

7. Run Evaluation

Execute the evaluation:

Advanced Usage

Upload Results to Phoenix

Upload your evaluation results back to Phoenix for visualization:

Custom Evaluation Workflows

You can create more complex evaluation workflows by combining multiple evaluators:

Observe

Once you have evaluation results uploaded to Phoenix, you can:

  • View evaluation metrics: See overall performance across different evaluation criteria

  • Analyze individual cases: Drill down into specific examples that passed or failed

  • Compare evaluators: Understand how different evaluation methods perform

  • Track improvements: Monitor evaluation scores over time as you improve your application

  • Debug failures: Identify patterns in failed evaluations to guide improvements

The Phoenix UI will display your evaluation results with detailed breakdowns, making it easy to understand your AI application's performance and identify areas for improvement.

Resources

Last updated

Was this helpful?