Advanced options for running evals on experiments via code
Advanced: Evaluator as a Class
Users have the option to run an experiment by creating an evaluator that inherits from the Evaluator(ABC) base class in the Arize Python SDK. The evaluator takes in a single dataset row as input and returns an EvaluationResult dataclass.
This is an alternative you can use if you'd prefer to use object oriented programming instead of functional programming.
Eval Class Inputs
The Eval argument values are supported below:
input
experiment run input
def eval(input): ...
output
experiment run output
def eval(output): ...
dataset_row
the entire row of the data, including every column as dictionary key
def eval(expected): ...
metadata
experiment metadata
def eval(metadata): ...
class ExampleAll(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator Using All Inputs")
class ExampleDatasetrow(Evaluator):
def evaluate(self, dataset_row, **kwargs) -> EvaluationResult:
print("Evaluator Using dataset_row ")
class ExampleInput(Evaluator):
def evaluate(self, input, **kwargs) -> EvaluationResult:
print("Evaluator Using Input")
class ExampleOutput(Evaluator):
def evaluate(self, output, **kwargs) -> EvaluationResult:
print("Evaluator Using Output")
EvaluationResult Outputs
The EvaluationResult results can be a score, label, tuple (score, label, explanation) or a Class EvaluationResult
EvaluationResult
Score, label and explanation
float
Score output
string
Label string output
class ExampleResult(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator Using All Inputs")
return(EvaluationResult(score=score, label=label, explanation=explanation)
class ExampleScore(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator Using A float")
return 1.0
class ExampleLabel(Evaluator):
def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:
print("Evaluator label")
return "good"
Code Evaluator as Class
from arize.experimental.datasets.experiments.evaluators.base import Evaluator, EvaluationResult
class MatchesExpected(Evaluator):
annotator_kind = "CODE"
name = "matches_expected"
def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
expected_output = dataset_row.get("expected")
label = expected_output == output
score = float(label)
return EvaluationResult(score=score, label=label)
async def async_evaluate(self, _: Example, exp_run: ExperimentRun) -> EvaluationResult:
return self.evaluate(_, exp_run)
You can run this class using the following:
arize_client.run_experiment(
space_id="",
dataset_name="",
task="",
evaluators=[MatchesExpected()],
experiment_name=""
)
LLM Evaluator as Class Example
Here's an example of a LLM evaluator that checks for hallucinations in the model output. The Phoenix Evals package is designed for running evaluations in code:
from phoenix.evals import (
HALLUCINATION_PROMPT_RAILS_MAP,
HALLUCINATION_PROMPT_TEMPLATE,
llm_classify,
)
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel
class HallucinationEvaluator(Evaluator):
def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
print("Evaluating outputs")
expected_output = dataset_row["attributes.llm.output_messages"]
# Create a DataFrame with the actual and expected outputs
df_in = pd.DataFrame(
{"selected_output": output, "expected_output": expected_output}, index=[0]
)
# Run the LLM classification
expect_df = llm_classify(
dataframe=df_in,
template=HALLUCINATION_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
rails=HALLUCINATION_PROMPT_RAILS_MAP,
provide_explanation=True,
)
label = expect_df["label"][0]
score = 1 if label == rails[1] else 0 # Score 1 if output is incorrect
explanation = expect_df["explanation"][0]
# Return the evaluation result
return EvaluationResult(score=score, label=label, explanation=explanation)
In this example, the HallucinationEvaluator
class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify
function runs the eval, and the evaluator returns an EvaluationResult
that includes a score, label, and explanation.
Advance: Multiple Evaluators on Experiment Runs
Arize supports running multiple evals on a single experiment, allowing you to comprehensively assess your model's performance from different angles. When you provide multiple evaluators, Arize creates evaluation runs for every combination of experiment runs and evaluators
from arize.experiments import run_experiment
from arize.experiments.evaluators import ContainsKeyword, MatchesRegex
experiment = run_experiment(
dataset,
task,
evaluators=[
ContainsKeyword("hello"),
MatchesRegex(r"\d+"),
custom_evaluator_function
]
)
Last updated
Was this helpful?