Log Evaluations to Arize
Evaluations are essential to understanding how well your model is performing in real-world scenarios, allowing you to identify strengths, weaknesses, and areas of improvement.
Offline evaluations are run as code and then sent back to Arize using log_evaluations_sync.
This guide assumes you have traces in Arize and are looking to run an evaluation to measure your application performance.
To add evaluations you can run evaluations in the UI as a task to run automatically, or you can follow the steps below to generate evaluations and log them to Arize:
Import your spans in code
Once you have traces in Arize, visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button, you can get the boilerplate code to copy/paste to your evaluator.
Example exported boilerplate:
# this will be prefilled by the export command.
# Note: This uses a different API Key than the one above.
ARIZE_API_KEY = ''
# import statements required for getting your spans
import os
os.environ['ARIZE_API_KEY'] = ARIZE_API_KEY
from datetime import datetime
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
# Exporting your dataset into a dataframe
client = ArizeExportClient()
primary_df = client.export_model_to_df(
space_id='', # this will be prefilled by export
model_id='', # this will be prefilled by export
environment=Environments.TRACING,
start_time=datetime.fromisoformat(''), # this will be prefilled by export
end_time=datetime.fromisoformat(''), # this will be prefilled by export
)Run a custom evaluator using Phoenix Evals
The Phoenix Evals package is designed for running evaluations in code. It provides:
Code-based evaluations - Run evaluations programmatically in Python scripts, notebooks, or CI/CD pipelines.
Eval Models - Phoenix Evals lets you configure which foundation model you'd like to use as a judge.
Speed - Evals run in batches and typically run 10x faster than calling the APIs directly.
Built-in explanations - All Phoenix evaluations include an explanation flag.
Example usage:
import os
from phoenix.evals import OpenAIModel, llm_classifyEnsure you have your OpenAI API keys setup correctly for your OpenAI model.
api_key = os.environ.get("OPENAI_API_KEY")
eval_model = OpenAIModel(
model="gpt-4o", temperature=0, api_key=api_key
)Create a prompt template for the LLM to judge the quality of your responses. You can utilize any of the Arize Evaluator Templates or create your own. Example template which judges positivity/negativity:
MY_CUSTOM_TEMPLATE = '''
You are evaluating the positivity or negativity of the responses to questions.
[BEGIN DATA]
************
[Question]: {input}
************
[Response]: {output}
[END DATA]
Please focus on the tone of the response.
Your answer must be single word, either "positive" or "negative"
'''Check which attributes are present in your traces dataframe:
primary_df.columnsIf you're using OpenAI traces, set the input/output variables like this:
Run the evaluation using llm_classify:
evals_df = llm_classify(
dataframe=primary_df,
template = MY_CUSTOM_TEMPLATE,
model=eval_model,
rails=["positive", "negative"]
)Log evaluations back to Arize
Use the log_evaluations_sync function from the Python SDK to attach evaluations you've run to traces. The code below assumes you have the evals_dataframe from your evaluation run and a traces_dataframe to get the span_id needed to attach the evals.
The evals_dataframe requires four columns (the <eval_name> must be alphanumeric and cannot have hyphens or spaces):
eval.<eval_name>.labeleval.<eval_name>.scoreeval.<eval_name>.explanationcontext.span_id
Example evaluation data:
evaluation_data = {
'context.span_id': ['74bdfb83-a40e-4351-9f41-19349e272ae9'], # Use your span_id
'eval.myeval.label': ['accuracy'], # Example label name
'eval.myeval.score': [0.95], # Example label value
'eval.myeval.explanation': ["some explanation"]
}
evaluation_df = pd.DataFrame(evaluation_data)Sample code to run your evaluation and log it in real-time to Arize:
import os
from arize.pandas.logger import Client
API_KEY = os.environ.get("ARIZE_API_KEY")
SPACE_ID = os.environ.get("ARIZE_SPACE_ID")
DEVELOPER_KEY = os.environ.get('ARIZE_DEVELOPER_KEY')
# Initialize Arize client using the model_id and version you used previously
arize_client = Client(
space_id=SPACE_ID,
api_key=API_KEY,
)
# Set the evals_df to have the correct span ID to log it to Arize
evals_df = evals_df.set_index(primary_df["context.span_id"])
# send the eval to Arize
arize_client.log_evaluations_sync(evals_df, 'YOUR_PROJECT_NAME')Last updated
Was this helpful?

