1 of 3

How-to: Experiments

How to run experiments

How to upload a Dataset
How to run a custom task
How to configure evaluators
How to run the experiment

How to use evaluators

LLM Evaluators
Code Evaluators
Custom Evaluators

Run Experiments

The following are the key steps of running an experiment illustrated by simple example.

Setup

Make sure you have Phoenix and the instrumentors needed for the experiment setup. For this example we will use the OpenAI instrumentor to trace the LLM calls.

pip install arize-phoenix openinference-instrumentation-openai openai

Run Experiments

The key steps of running an experiment are:

Define/upload a Dataset (e.g. a dataframe)
- Each record of the dataset is called an Example
Define a task
- A task is a function that takes each Example and returns an output
Define Evaluators
- An Evaluator is a function evaluates the output for each Example
Run the experiment

We'll start by launching the Phoenix app.

import phoenix as px

px.launch_app()

Load a Dataset

A dataset can be as simple as a list of strings inside a dataframe. More sophisticated datasets can be also extracted from traces based on actual production data. Here we just have a small list of questions that we want to ask an LLM about the NBA games:

Create pandas dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "question": [
            "Which team won the most games?",
            "Which team won the most games in 2015?",
            "Who led the league in 3 point shots?",
        ]
    }
)

The dataframe can be sent to Phoenix via the Client. input_keys and output_keys are column names of the dataframe, representing the input/output to the task in question. Here we have just questions, so we left the outputs blank:

Upload dataset to Phoenix

import phoenix as px

dataset = px.Client().upload_dataset(
    dataframe=df,
    input_keys=["question"],
    output_keys=[],
    dataset_name="nba-questions",
)

Each row of the dataset is called an Example.

Create a Task

A task is any function/process that returns a JSON serializable output. Task can also be an async function, but we used sync function here for simplicity. If the task is a function of one argument, then that argument will be bound to the input field of the dataset example.

def task(x):
    return ...

For our example here, we'll ask an LLM to build SQL queries based on our question, which we'll run on a database and obtain a set of results:

Set Up Database

import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]
conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

Set Up Prompt and LLM

from textwrap import dedent

import openai

client = openai.Client()
columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

LLM_MODEL = "gpt-4o"

columns_str = ",".join(column["column_name"] + ": " + column["column_type"] for column in columns)
system_prompt = dedent(f"""
You are a SQL expert, and you are given a single table named nba with the following columns:
{columns_str}\n
Write a SQL query corresponding to the user's
request. Return just the query text, with no formatting (backticks, markdown, etc.).""")


def generate_query(question):
    response = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content


def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


def text2sql(question):
    results = error = None
    try:
        query = generate_query(question)
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)
    return {"query": query, "results": results, "error": error}

Define task as a Function

Recall that each row of the dataset is encapsulated as Example object. Recall that the input keys were defined when we uploaded the dataset:

def task(x):
    return text2sql(x["question"])

More complex task inputs

More complex tasks can use additional information. These values can be accessed by defining a task function with specific parameter names which are bound to special values associated with the dataset example:

Parameter name

Description

Example

input

example input

def task(input): ...

expected

example output

def task(expected): ...

reference

alias for expected

def task(reference): ...

metadata

example metadata

def task(metadata): ...

example

Example object

def task(example): ...

A task can be defined as a sync or async function that takes any number of the above argument names in any order!

Define Evaluators

An evaluator is any function that takes the task output and return an assessment. Here we'll simply check if the queries succeeded in obtaining any result from the database:

def no_error(output) -> bool:
    return not bool(output.get("error"))


def has_results(output) -> bool:
    return bool(output.get("results"))

Run an Experiment

Instrument OpenAI

Instrumenting the LLM will also give us the spans and traces that will be linked to the experiment, and can be examine in the Phoenix UI:

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Run the Task and Evaluators

Running an experiment is as easy as calling run_experiment with the components we defined above. The results of the experiment will be show up in Phoenix:

from phoenix.experiments import run_experiment

experiment = run_experiment(dataset, task=task, evaluators=[no_error, has_results])

Add More Evaluations

If you want to attach more evaluations to the same experiment after the fact, you can do so with `evaluate_experiment`.

from phoenix.experiments import evaluate_experiment

evaluators = [
    # add evaluators here
]
experiment = evaluate_experiment(experiment, evaluators)

If you no longer have access to the original experiment object, you can retrieve it from Phoenix using the get_experiment client method.

from phoenix.experiments import evaluate_experiment
import phoenix as px

experiment_id = "experiment-id" # set your experiment ID here
experiment = px.Client().get_experiment(experiment_id=experiment_id)
evaluators = [
    # add evaluators here
]
experiment = evaluate_experiment(experiment, evaluators)

Dry Run

Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. run_experiment() and evaluate_experiment() both are equipped with a dry_run= parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting dry_run=True selects one sample from the dataset, and setting it to a number, e.g. dry_run=3, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.

Using Evaluators

LLM Evaluators

We provide LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with a Phoenix model wrapper:

from phoenix.experiments.evaluators import HelpfulnessEvaluator
from phoenix.evals.models import OpenAIModel

helpfulness_evaluator = HelpfulnessEvaluator(model=OpenAIModel())

Code Evaluators

Code evaluators are functions that evaluate the output of your LLM task that don't use another LLM as a judge. An example might be checking for whether or not a given output contains a link - which can be implemented as a RegEx match.

phoenix.experiments.evaluators contains some pre-built code evaluators that can be passed to the evaluators parameter in experiments.

from phoenix.experiments import run_experiment, MatchesRegex

# This defines a code evaluator for links
contains_link = MatchesRegex(
    pattern=r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)",
    name="contains_link"
)

The above contains_link evaluator can then be passed as an evaluator to any experiment you'd like to run.

For a full list of code evaluators, please consult repo or API documentation.

Custom Evaluators

The simplest way to create an evaluator is just to write a Python function. By default, a function of one argument will be passed the output of an experiment run. These custom evaluators can either return a boolean or numeric value which will be recorded as the evaluation score.

Imagine our experiment is testing a task that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:

def in_bounds(x):
    return 1 <= x <= 100

By simply passing the in_bounds function to run_experiment, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.

More complex evaluations can use additional information. These values can be accessed by defining a function with specific parameter names which are bound to special values:

Parameter name

Description

Example

input

experiment run input

def eval(input): ...

output

experiment run output

def eval(output): ...

expected

example output

def eval(expected): ...

reference

alias for expected

def eval(reference): ...

metadata

experiment metadata

def eval(metadata): ...

These parameters can be used in any combination and any order to write custom complex evaluators!

Below is an example of using the editdistance library to calculate how close the output is to the expected value:

pip install editdistance

def edit_distance(output, expected) -> int:
    return editdistance.eval(
        json.dumps(output, sort_keys=True), json.dumps(expected, sort_keys=True)
    )

For even more customization, use the create_evaluator decorator to further customize how your evaluations show up in the Experiments UI.

from phoenix.experiments.evaluators import create_evaluator

# the decorator can be used to set display properties
# `name` corresponds to the metric name shown in the UI
# `kind` indicates if the eval was made with a "CODE" or "LLM" evaluator
@create_evaluator(name="shorter?", kind="CODE")
def wordiness_evaluator(expected, output):
    reference_length = len(expected.split())
    output_length = len(output.split())
    return output_length < reference_length

The decorated wordiness_evaluator can be passed directly into run_experiment!

Multiple Evaluators on Experiment Runs

Phoenix supports running multiple evals on a single experiment, allowing you to comprehensively assess your model's performance from different angles. When you provide multiple evaluators, Phoenix creates evaluation runs for every combination of experiment runs and evaluators.

from phoenix.experiments import run_experiment
from phoenix.experiments.evaluators import ContainsKeyword, MatchesRegex

experiment = run_experiment(
    dataset,
    task,
    evaluators=[
        ContainsKeyword("hello"),
        MatchesRegex(r"\d+"),
        custom_evaluator_function
    ]
)

Using Evaluators

LLM Evaluators

We provide LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with a Phoenix model wrapper:

from phoenix.experiments.evaluators import HelpfulnessEvaluator
from phoenix.evals.models import OpenAIModel

helpfulness_evaluator = HelpfulnessEvaluator(model=OpenAIModel())

Code Evaluators

phoenix.experiments.evaluators contains some pre-built code evaluators that can be passed to the evaluators parameter in experiments.

from phoenix.experiments import run_experiment, MatchesRegex

# This defines a code evaluator for links
contains_link = MatchesRegex(
    pattern=r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)",
    name="contains_link"
)

The above contains_link evaluator can then be passed as an evaluator to any experiment you'd like to run.

For a full list of code evaluators, please consult repo or API documentation.

Custom Evaluators

Imagine our experiment is testing a task that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:

def in_bounds(x):
    return 1 <= x <= 100

By simply passing the in_bounds function to run_experiment, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.

More complex evaluations can use additional information. These values can be accessed by defining a function with specific parameter names which are bound to special values:

Parameter name

Description

Example

input

experiment run input

def eval(input): ...

output

experiment run output

def eval(output): ...

expected

example output

def eval(expected): ...

reference

alias for expected

def eval(reference): ...

metadata

experiment metadata

def eval(metadata): ...

These parameters can be used in any combination and any order to write custom complex evaluators!

Below is an example of using the editdistance library to calculate how close the output is to the expected value:

pip install editdistance

def edit_distance(output, expected) -> int:
    return editdistance.eval(
        json.dumps(output, sort_keys=True), json.dumps(expected, sort_keys=True)
    )

For even more customization, use the create_evaluator decorator to further customize how your evaluations show up in the Experiments UI.

from phoenix.experiments.evaluators import create_evaluator

# the decorator can be used to set display properties
# `name` corresponds to the metric name shown in the UI
# `kind` indicates if the eval was made with a "CODE" or "LLM" evaluator
@create_evaluator(name="shorter?", kind="CODE")
def wordiness_evaluator(expected, output):
    reference_length = len(expected.split())
    output_length = len(output.split())
    return output_length < reference_length

The decorated wordiness_evaluator can be passed directly into run_experiment!

Multiple Evaluators on Experiment Runs

from phoenix.experiments import run_experiment
from phoenix.experiments.evaluators import ContainsKeyword, MatchesRegex

experiment = run_experiment(
    dataset,
    task,
    evaluators=[
        ContainsKeyword("hello"),
        MatchesRegex(r"\d+"),
        custom_evaluator_function
    ]
)

Run Experiments

The following are the key steps of running an experiment illustrated by simple example.

Setup

Make sure you have Phoenix and the instrumentors needed for the experiment setup. For this example we will use the OpenAI instrumentor to trace the LLM calls.

pip install arize-phoenix openinference-instrumentation-openai openai

Run Experiments

The key steps of running an experiment are:

Define/upload a Dataset (e.g. a dataframe)
- Each record of the dataset is called an Example
Define a task
- A task is a function that takes each Example and returns an output
Define Evaluators
- An Evaluator is a function evaluates the output for each Example
Run the experiment

We'll start by launching the Phoenix app.

import phoenix as px

px.launch_app()

Load a Dataset

Create pandas dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "question": [
            "Which team won the most games?",
            "Which team won the most games in 2015?",
            "Who led the league in 3 point shots?",
        ]
    }
)

Upload dataset to Phoenix

import phoenix as px

dataset = px.Client().upload_dataset(
    dataframe=df,
    input_keys=["question"],
    output_keys=[],
    dataset_name="nba-questions",
)

Each row of the dataset is called an Example.

Create a Task

def task(x):
    return ...

For our example here, we'll ask an LLM to build SQL queries based on our question, which we'll run on a database and obtain a set of results:

Set Up Database

import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]
conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

Set Up Prompt and LLM

from textwrap import dedent

import openai

client = openai.Client()
columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

LLM_MODEL = "gpt-4o"

columns_str = ",".join(column["column_name"] + ": " + column["column_type"] for column in columns)
system_prompt = dedent(f"""
You are a SQL expert, and you are given a single table named nba with the following columns:
{columns_str}\n
Write a SQL query corresponding to the user's
request. Return just the query text, with no formatting (backticks, markdown, etc.).""")


def generate_query(question):
    response = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content


def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


def text2sql(question):
    results = error = None
    try:
        query = generate_query(question)
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)
    return {"query": query, "results": results, "error": error}

Define task as a Function

Recall that each row of the dataset is encapsulated as Example object. Recall that the input keys were defined when we uploaded the dataset:

def task(x):
    return text2sql(x["question"])

More complex task inputs

Parameter name

Description

Example

input

example input

def task(input): ...

expected

example output

def task(expected): ...

reference

alias for expected

def task(reference): ...

metadata

example metadata

def task(metadata): ...

example

Example object

def task(example): ...

A task can be defined as a sync or async function that takes any number of the above argument names in any order!

Define Evaluators

An evaluator is any function that takes the task output and return an assessment. Here we'll simply check if the queries succeeded in obtaining any result from the database:

def no_error(output) -> bool:
    return not bool(output.get("error"))


def has_results(output) -> bool:
    return bool(output.get("results"))

Run an Experiment

Instrument OpenAI

Instrumenting the LLM will also give us the spans and traces that will be linked to the experiment, and can be examine in the Phoenix UI:

from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Run the Task and Evaluators

Running an experiment is as easy as calling run_experiment with the components we defined above. The results of the experiment will be show up in Phoenix:

from phoenix.experiments import run_experiment

experiment = run_experiment(dataset, task=task, evaluators=[no_error, has_results])

Add More Evaluations

If you want to attach more evaluations to the same experiment after the fact, you can do so with `evaluate_experiment`.

from phoenix.experiments import evaluate_experiment

evaluators = [
    # add evaluators here
]
experiment = evaluate_experiment(experiment, evaluators)

If you no longer have access to the original experiment object, you can retrieve it from Phoenix using the get_experiment client method.

from phoenix.experiments import evaluate_experiment
import phoenix as px

experiment_id = "experiment-id" # set your experiment ID here
experiment = px.Client().get_experiment(experiment_id=experiment_id)
evaluators = [
    # add evaluators here
]
experiment = evaluate_experiment(experiment, evaluators)

How-to: Experiments

How to run experiments

How to use evaluators

Run Experiments

Setup

Run Experiments

Load a Dataset

Create a Task

Define Evaluators

Run an Experiment

Add More Evaluations

If you want to attach more evaluations to the same experiment after the fact, you can do so with evaluate_experiment.

Dry Run

Using Evaluators

LLM Evaluators

Code Evaluators

Custom Evaluators

Multiple Evaluators on Experiment Runs

Using Evaluators

LLM Evaluators

Code Evaluators

Custom Evaluators

Multiple Evaluators on Experiment Runs

Run Experiments

Setup

Run Experiments

Load a Dataset

Create a Task

Define Evaluators

Run an Experiment

Add More Evaluations

If you want to attach more evaluations to the same experiment after the fact, you can do so with evaluate_experiment.

Dry Run

How-to: Experiments

How to run experiments

How to use evaluators

If you want to attach more evaluations to the same experiment after the fact, you can do so with `evaluate_experiment`.

If you want to attach more evaluations to the same experiment after the fact, you can do so with `evaluate_experiment`.