Evaluate experiment with code
How to write the functions to evaluate your task outputs in experiments
Here's the simplest version of an evaluation function:
def is_true(output):
# output is the task output
return output == True
You can define a simple function to read the output of a task and check it.
Evaluation Inputs
The evaluator function can take the following optional arguments:
dataset_row
the entire row of the data, including every column as dictionary key
def eval(dataset_row): ...
input
experiment run input, which is mapped to attributes.input.value
def eval(input): ...
output
experiment run output
def eval(output): ...
dataset_output
the expected output if available, mapped to attributes.output.value
def eval(dataset_output): ...
metadata
dataset_row metadata, which is mapped to attributes.metadata
def eval(metadata): ...
Evaluation Outputs
We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.
boolean
True
label = 'True' score = 1.0
float
1.0
score = 1.0
string
"reasonable"
label = 'reasonable'
tuple
(1.0, "my explanation notes")
score = 1.0 explanation = 'my explanation notes'
tuple
("True", 1.0, "my explanation")
label = 'True' score = 1.0 explanation = "my explanation"
EvaluationResult
EvaluationResult(
score=1,
label='reasonable', explanation='explanation'
metadata={}
)
score = 1.0
label='reasonable' explanation = 'explanation' metadata={}
To use EvaluationResult class, use the following import statement:
from arize.experimental.datasets.experiments.types import EvaluationResult
One of label or score must be supplied (you can't have an evaluation with no result).
Here's another example which compares the output to a value in the dataset_row.
"""Example dataset
dataframe = pd.DataFrame({
"expected": [2]
})
"""
def is_equal(dataset_row, output):
expected = dataset_row.get("expected")
return expected == output
To run the experiment, you can load the evaluator into run_experiment
as following:
arize_client.run_experiment(
space_id="",
dataset_name="",
task="",
evaluators=[is_equal],
experiment_name=""
)
Create an LLM Evaluator
LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.
Arize supports a large number of LLM evaluators out of the box with LLM Classify: Arize Templates
Here's an example of a LLM evaluator that checks for hallucinations in the model output:
from phoenix.evals import llm_classify
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel
HALLUCINATION_PROMPT_TEMPLATE = """
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text.
Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.
# Query: {query}
# Reference text: {reference}
# Answer: {response}
Is the answer above factual or hallucinated based on the query and reference text?
"""
def hallucination_eval(output, dataset_row):
# Get the original query and reference text from the dataset_row
query = dataset_row.get("query")
reference = dataset_row.get("reference")
# Create a DataFrame to pass into llm_classify
df_in = pd.DataFrame(
{"query": query, "reference": reference, "response": output}, index=[0]
)
# Run the LLM classification
eval_df = llm_classify(
dataframe=df_in,
template=HALLUCINATION_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
rails=["factual", "hallucinated"],
provide_explanation=True,
)
# Map the eval df to EvaluationResult
label = eval_df["label"][0]
score = 1 if label == "factual" else 0
explanation = eval_df["explanation"][0]
# Return the evaluation result
return EvaluationResult(label=label, score=score, explanation=explanation)
In this example, the HallucinationEvaluator
class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify
function runs the eval, and the evaluator returns an EvaluationResult
that includes a score, label, and explanation.
Once you define your evaluator class, you can use it in your experiment run like this:
experiment = arize_client.run_experiment(
space_id=SPACE_ID,
dataset_id=DATASET_ID,
task=run_task,
evaluators=[hallucination_eval],
experiment_name=experiment_name,
)
You can customize LLM evaluators to suit your experiment's needs — update the template with your instructions and the rails with the desired output.
Create a code evaluator
Code Evaluators
Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.
Custom Code Evaluators
Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score.
For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range:
def in_bounds(output):
return 1 <= output <= 100
By passing the in_bounds
function to run_experiment
, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.
experiment = arize_client.run_experiment(
space_id=SPACE_ID,
dataset_id=DATASET_ID,
task=run_task,
evaluators=[in_bounds],
experiment_name=experiment_name,
)
Prebuilt Phoenix Code Evaluators
Alternatively, you can leverage one of our prebuilt evaluators using phoenix.experiments.evaluators
This evaluator checks whether the output of an experiment run is a JSON-parsable string. It's useful when you want to ensure that the generated output can be correctly formatted and parsed as JSON, which is critical for applications that rely on structured data formats.
from phoenix.experiments import JSONParsable
# This defines a code evaluator that checks if the output is JSON-parsable
json_parsable_evaluator = JSONParsable()
Advanced: Evaluator as a Class
Users have the option to run an experiment by creating an evaluator that inherits from the Evaluator(ABC) base class in the Arize Python SDK. The evaluator takes in a single dataset row as input and returns an EvaluationResult dataclass.
This is an alternative you can use if you'd prefer to use object oriented programming instead of functional programming.
Eval Class Inputs
The Eval argument values are supported below:
input
experiment run input
def eval(input): ...
output
experiment run output
def eval(output): ...
dataset_row
the entire row of the data, including every column as dictionary key
def eval(expected): ...
metadata
experiment metadata
def eval(metadata): ...
class ExampleAll(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator Using All Inputs")
class ExampleDatasetrow(Evaluator):
def evaluate(self, dataset_row, **kwargs) -> EvaluationResult:
print("Evaluator Using dataset_row ")
class ExampleInput(Evaluator):
def evaluate(self, input, **kwargs) -> EvaluationResult:
print("Evaluator Using Input")
class ExampleOutput(Evaluator):
def evaluate(self, output, **kwargs) -> EvaluationResult:
print("Evaluator Using Output")
EvaluationResult Outputs
The EvaluationResult results can be a score, label, tuple (score, label, explanation) or a Class EvaluationResult
EvaluationResult
Score, label and explanation
float
Score output
string
Label string output
class ExampleResult(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator Using All Inputs")
return(EvaluationResult(score=score, label=label, explanation=explanation)
class ExampleScore(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator Using A float")
return 1.0
class ExampleLabel(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator label")
return "good"
Code Evaluator as Class
from arize.experimental.datasets.experiments.evaluators.base import Evaluator, EvaluationResult
class MatchesExpected(Evaluator):
annotator_kind = "CODE"
name = "matches_expected"
def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
expected_output = dataset_row.get("expected")
label = expected_output == output
score = float(label)
return EvaluationResult(score=score, label=label)
async def async_evaluate(self, _: Example, exp_run: ExperimentRun) -> EvaluationResult:
return self.evaluate(_, exp_run)
You can run this class using the following:
arize_client.run_experiment(
space_id="",
dataset_name="",
task="",
evaluators=[MatchesExpected()],
experiment_name=""
)
LLM Evaluator as Class Example
Here's an example of a LLM evaluator that checks for hallucinations in the model output. The Phoenix Evals package is designed for running evaluations in code:
from phoenix.evals import (
HALLUCINATION_PROMPT_RAILS_MAP,
HALLUCINATION_PROMPT_TEMPLATE,
llm_classify,
)
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel
class HallucinationEvaluator(Evaluator):
def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
print("Evaluating outputs")
expected_output = dataset_row["attributes.llm.output_messages"]
# Create a DataFrame with the actual and expected outputs
df_in = pd.DataFrame(
{"selected_output": output, "expected_output": expected_output}, index=[0]
)
# Run the LLM classification
expect_df = llm_classify(
dataframe=df_in,
template=HALLUCINATION_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
rails=HALLUCINATION_PROMPT_RAILS_MAP,
provide_explanation=True,
)
label = expect_df["label"][0]
score = 1 if label == rails[1] else 0 # Score 1 if output is incorrect
explanation = expect_df["explanation"][0]
# Return the evaluation result
return EvaluationResult(score=score, label=label, explanation=explanation)
In this example, the HallucinationEvaluator
class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify
function runs the eval, and the evaluator returns an EvaluationResult
that includes a score, label, and explanation.
Advanced: Multiple Evaluators on Experiment Runs
Arize supports running multiple evals on a single experiment, allowing you to comprehensively assess your model's performance from different angles. When you provide multiple evaluators, Arize creates evaluation runs for every combination of experiment runs and evaluators
from arize.experiments import run_experiment
from arize.experiments.evaluators import ContainsKeyword, MatchesRegex
experiment = run_experiment(
dataset,
task,
evaluators=[
ContainsKeyword("hello"),
MatchesRegex(r"\d+"),
custom_evaluator_function
]
)
Last updated
Was this helpful?