OpenAI Agents Cookbook

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

  • Create an agent using the OpenAI agents SDK

  • Trace the agent activity

  • Create a dataset to benchmark performance

  • Run an experiment to evaluate agent performance using LLM as a judge

Initial setup

Install Libraries

!pip install -q arize-otel openinference-instrumentation-openai-agents openinference-instrumentation-openai arize-phoenix-evals "arize[Datasets]"

!pip install -q openai opentelemetry-sdk opentelemetry-exporter-otlp gcsfs nest_asyncio openai-agents

Setup Keys

Copy the Arize API_KEY and SPACE_ID from your Space Settings page (shown below) to the variables in the cell below.

import os
import nest_asyncio
from getpass import getpass

nest_asyncio.apply()

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Setup Tracing

from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor

# Setup OpenTelemetry via our convenience function
tracer_provider = register(
    space_id=SPACE_ID,
    api_key=API_KEY,
    project_name="agents-cookbook",
)

# Start instrumentation
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Create your first agent with the OpenAI SDK

Here we've setup a basic agent that can solve math problems.

We have a function tool that can solve math equations, and an agent that can use this tool.

We'll use the Runner class to run the agent and get the final output.

from agents import function_tool, Runner


@function_tool
def solve_equation(equation: str) -> str:
    """Use python to evaluate the math equation, instead of thinking about it yourself.

    Args:
       equation: string which to pass into eval() in python
    """
    return str(eval(equation))
from agents import Agent

agent = Agent(
    name="Math Solver",
    instructions="You solve math problems by evaluating them with python and returning the result",
    tools=[solve_equation],
)
result = await Runner.run(agent, "what is 15 + 28?")

# Run Result object
print(result)

# Get the final output
print(result.final_output)

# Get the entire list of messages recorded to generate the final output
print(result.to_input_list())

Now we have a basic agent, let's evaluate whether the agent responded correctly!

Evaluating our agent

Agents can go awry for a variety of reasons.

  1. Tool call accuracy - did our agent choose the right tool with the right arguments?

  2. Tool call results - did the tool respond with the right results?

  3. Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?

We'll setup a simple evaluator that will check if the agent's response is correct, you can read about different types of agent evals here.

Let's setup our evaluation by defining our task function, our evaluator, and our dataset.

import asyncio
from agents import Runner


# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.
async def solve_math_problem(dataset_row: dict):
    result = await Runner.run(agent, dataset_row.get("question"))
    # OPTIONAL: You don't need to return the messages unless you want to use them in your eval
    return {
        "final_output": result.final_output,
        "messages": result.to_input_list(),
    }


dataset_row = {"question": "What is 15 + 28?"}

result = asyncio.run(solve_math_problem(dataset_row))
print(result)

Let's create our evaluator.

import pandas as pd
from phoenix.evals import OpenAIModel, llm_classify
from arize.experimental.datasets.experiments.types import EvaluationResult


def correctness_eval(dataset_row: dict, output: dict) -> EvaluationResult:
    # Create a dataframe with the question and answer
    df_in = pd.DataFrame(
        {"question": [dataset_row.get("question")], "response": [output]}
    )

    # Template for evaluating math problem solutions
    MATH_EVAL_TEMPLATE = """
    You are evaluating whether a math problem was solved correctly.
    
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]
    
    Assess if the answer to the math problem is correct. First work out the correct answer yourself,
    then compare with the provided response. Consider that there may be different ways to express the same answer 
    (e.g., "43" vs "The answer is 43" or "5.0" vs "5").
    
    Your answer must be a single word, either "correct" or "incorrect"
    """

    # Run the evaluation
    rails = ["correct", "incorrect"]
    eval_df = llm_classify(
        data=df_in,
        template=MATH_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=rails,
        provide_explanation=True,
    )

    # Extract results
    label = eval_df["label"][0]
    score = 1 if label == "correct" else 0
    explanation = eval_df["explanation"][0]

    # Return the evaluation result
    return EvaluationResult(score=score, label=label, explanation=explanation)

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our math problem solving agent.

MATH_GEN_TEMPLATE = """
You are an assistant that generates diverse math problems for testing a math solver agent.
The problems should include:

Basic Operations: Simple addition, subtraction, multiplication, division problems.
Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.
Exponents and Roots: Problems involving powers, square roots, and other nth roots.
Percentages: Problems involving calculating percentages of numbers or finding percentage changes.
Fractions: Problems with addition, subtraction, multiplication, or division of fractions.
Algebra: Simple algebraic expressions that can be evaluated with specific values.
Sequences: Finding sums, products, or averages of number sequences.
Word Problems: Converting word problems into mathematical equations.

Do not include any solutions in your generated problems.

Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.
Generate 25 diverse math problems. Ensure there are no duplicate problems.
"""
import nest_asyncio

nest_asyncio.apply()
pd.set_option("display.max_colwidth", 500)

# Initialize the model
model = OpenAIModel(model="gpt-4o", max_tokens=1300)

# Generate math problems
resp = model(MATH_GEN_TEMPLATE)

# Create DataFrame
split_response = resp.strip().split("\n")
math_problems_df = pd.DataFrame(split_response, columns=["question"])
print(math_problems_df.head())

Now let's use this dataset and run it with the agent!

Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

from arize.experimental.datasets import ArizeDatasetsClient
from uuid import uuid1
from arize.experimental.datasets.utils.constants import GENERATIVE

# Set up the arize client
arize_client = ArizeDatasetsClient(api_key=API_KEY)

dataset_name = "math-questions-" + str(uuid1())[:5]

dataset_id = arize_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name=dataset_name,
    dataset_type=GENERATIVE,
    data=math_problems_df,
)
dataset = arize_client.get_dataset(space_id=SPACE_ID, dataset_id=dataset_id)
print(dataset)
experiment_id, experiment_dataframe = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=dataset_id,
    task=solve_math_problem,
    evaluators=[correctness_eval],
    experiment_name=f"solve-math-questions-{str(uuid1())[:5]}",
    dry_run=True,
)
experiment_dataframe

Last updated

Was this helpful?