Learn how to evaluate your LLM application’s performance using the Arize SDK. Step-by-step guide for running evaluations on traces and measuring quality.
Use this file to discover all available pages before exploring further.
This guide shows you how to measure the quality of your LLM application’s outputs using the Arize AX Python SDK. You’ll define evaluation criteria, run evaluations on your traces, and interpret the results.Traces tell us what happened during a run, but they don’t tell us whether the output was good. Evaluations fill that gap by letting us score outputs in a consistent, repeatable way.You’ll work with existing trace data in Arize AX, create an evaluation that defines quality criteria, and run it to score your outputs.
Evaluations are the bridge between “my application ran” and “my application is working well.” Without evaluations, you’re left manually inspecting outputs or relying on user feedback—both of which don’t scale.The Python SDK makes evaluations powerful by:
Integrating with your traces — evaluate the same data you’re already collecting
Providing programmatic access — run evaluations where you prefer, not just in the UI
Supporting multiple evaluation types — LLM-as-a-judge, code-based, and human labels
Enabling experimentation — use evaluation results to guide improvements
To follow along, you’ll need to have completed the Tracing tutorial which means we have:
Crew AI Financial Analysis and Research Chatbot
Trace Data in Arize
Follow along with code
This guide has a companion notebook with runnable code examples. Find it in this notebook.
Before we can run evaluations, we need something to evaluate.Evaluations in Arize run over existing trace data. If you followed the tracing guide, you should already have traces containing LLM inputs and outputsHaving multiple traces helps you see how quality varies across different runs. Generate more trace data by running your agent with various inputs if you need additional examples to evaluate.
test_queries = [ {"tickers": "AAPL", "focus": "financial analysis and market outlook"}, {"tickers": "NVDA", "focus": "valuation metrics and growth prospects"}, {"tickers": "AMZN", "focus": "profitability and market share"}, {"tickers": "AAPL, MSFT", "focus": "comparative financial analysis"}, {"tickers": "META, SNAP, PINS", "focus": "social media sector trends"}, {"tickers": "RIVN", "focus": "financial health and viability"}, {"tickers": "SNOW", "focus": "revenue growth trajectory"}, {"tickers": "KO", "focus": "dividend yield and stability"}, {"tickers": "META", "focus": "latest developments and stock performance"}, {"tickers": "AAPL, MSFT, GOOGL, AMZN, META", "focus": "big tech comparison and market outlook"}, {"tickers": "AMC", "focus": "financial analysis and market sentiment"},]for query in test_queries: crew.kickoff(inputs=query)
With the Python SDK, you can programmatically export your traces. This gives you a lot of control over your data—you can analyze, filter, and evaluate them at scale.
from arize import ArizeClientfrom datetime import datetime, timedeltaclient = ArizeClient(api_key=os.getenv("ARIZE_API_KEY"))# Export spans from the last hour (adjust time range as needed)end_time = datetime.now()start_time = end_time - timedelta(hours=1)df = client.spans.export_to_df( space_id=os.getenv("ARIZE_SPACE_ID"), project_name="arize-sdk-quickstart", start_time=start_time, end_time=end_time,)parent_spans = df[df["attributes.openinference.span.kind"] == "CHAIN"]
Now that we have trace data, the next question is how we decide whether an output is actually good.An evaluation makes that decision explicit. Instead of manually inspecting outputs or relying on intuition, we define a rule that can be applied consistently across many runs.In Arize, evaluations can be written in different ways. In this guide, we’ll use an LLM-as-a-Judge evaluation as a simple starting point. This works well for questions like correctness or relevance, and lets us get metrics quickly. (If you’d rather use code-based evaluations, you can follow the guide on setting those up.)For LLM-as-a-Judge evaluations, that means defining three things:
A prompt that describes the judgment criteria
An LLM that performs the evaluation
The data we want to score
In this step, we’ll define a basic completeness evaluation that checks whether the agent’s output completely answers the input.
Define the prompt that specifies how the evaluator should judge outputs. The prompt uses attributes.input.value and attributes.output.value to access the input and output data from your spans.
financial_completeness_template = """You are evaluating whether a financial research report correctly completes ALL parts of the user's task with COMPREHENSIVE coverage.User input: {attributes.input.value}Generated report:{attributes.output.value}To be marked as "complete", the report MUST meet ALL of these strict requirements:1. TICKER COVERAGE (MANDATORY): - Cover ALL companies/tickers mentioned in the input - If multiple tickers are listed, EACH must have dedicated analysis (not just mentioned in passing) - For multiple tickers, the report must provide COMPARATIVE analysis when relevant2. FOCUS AREA COVERAGE (MANDATORY): - Address ALL focus areas mentioned in the input - If the focus mentions multiple topics (e.g., "earnings and outlook"), BOTH must be thoroughly addressed - Each focus area must have substantial content, not just a brief mention3. FINANCIAL DATA REQUIREMENTS (MANDATORY): - For EACH ticker, the report must include: * Current/recent stock price or performance data * At least 2 key financial ratios (P/E, P/B, debt-to-equity, ROE, etc.) * Revenue or earnings information * Recent news or developments (within last 6 months) - If focus mentions specific metrics (e.g., "P/E ratio"), those MUST be explicitly provided4. DEPTH REQUIREMENT (MANDATORY): - Each ticker must have at least 3-4 sentences of dedicated analysis - Generic statements without specific data do NOT count - The report must demonstrate thorough research, not superficial coverage5. COMPARISON REQUIREMENT (if multiple tickers): - If 2+ tickers are requested, the report MUST include direct comparisons - Comparisons should cover multiple key metrics side-by-side - Generic statements like "both companies are good" do NOT satisfy this requirement - Must explicitly state which company performs better/worse on specific metricsThe report is "incomplete" if it fails ANY of the above requirements, including:- Missing any ticker or only mentioning it briefly- Failing to address any focus area or only addressing it superficially- Missing required financial data for any ticker- Providing generic analysis without specific metrics or data- Failing to provide comparisons when multiple tickers are requested- Not meeting the depth requirement for any tickerRespond with ONLY one word: "complete" or "incomplete"Then provide a detailed explanation of which specific requirements were met or failed."""
This prompt defines what completeness means for our application. By making the criteria explicit, we can apply it consistently and understand why outputs pass or fail.
We’ll use the Phoenix Evals library to run our evaluations by combining the prompt and model into an evaluator. Phoenix Evals provides reusable evaluation primitives (such as create_classifier andLLM) that make it easy to define and run evaluators. We can then use those evaluation results with the Arize AX SDK.While this example shows how to create a custom LLM-as-a-Judge evaluator, there are other ways to create evaluators, including code evaluators for deterministic checks and pre-built LLM evaluator templates for common evaluation scenarios.
Now, it’s time to evaluate your traces. We’ll apply the evaluator to the exported trace data.We’re using parent_spans because these represent the top-level agent executions that contain the final outputs we want to evaluate. Child spans are individual LLM calls or tool invocations that are part of the overall workflow, but in this case, we want to score the complete agent output.We’ll use evaluate_dataframe to run the evaluation across all our parent spans.
from phoenix.evals import evaluate_dataframefrom openinference.instrumentation import suppress_tracingresults_df = evaluate_dataframe(dataframe=parent_spans, evaluators=[completeness_evaluator])
This produces evaluation results for each span in the dataset. Each result includes a score, label, and explanation.
We’ll upload the results back to Arize AX where they’ll appear alongside your traces. This connects quality scores to execution data, giving you a complete picture of your application’s performance in one place.If you prefer, you can view the evaluation results programmatically—the evals_df DataFrame contains all evaluation results including score, label, and explanation for each run of the evaluator.
import pandas as pdfrom phoenix.evals.utils import to_annotation_dataframe# Convert Phoenix eval results to annotation formatannotation_df = to_annotation_dataframe(dataframe=results_df)# Get span_ids from parent_spansspan_ids = parent_spans["context.span_id"].values# Build evaluation DataFrameeval_results = []eval_name = "completeness"for idx, span_id in enumerate(span_ids): if idx < len(annotation_df): eval_result = { "context.span_id": span_id, f"eval.{eval_name}.label": annotation_df["label"].iloc[idx], f"eval.{eval_name}.score": float(annotation_df["score"].iloc[idx]), f"eval.{eval_name}.explanation": annotation_df["explanation"].iloc[idx], } eval_results.append(eval_result)evals_df = pd.DataFrame(eval_results)# Upload evaluations to Arize AXresponse = client.spans.update_evaluations( space_id=os.getenv("ARIZE_SPACE_ID"), project_name="arize-sdk-quickstart", dataframe=evals_df, force_http=True,)
With evaluation results logged to Arize AX, you’re ready to learn the workflows for continuous evaluation and improvement.To go deeper on evaluations, the Evaluator Docs cover writing more nuanced evaluators, using different scoring strategies, and comparing quality across runs as your application evolves.