Skip to main content
This guide shows you how to measure the quality of your LLM application’s outputs using the Arize AX Python SDK. You’ll define evaluation criteria, run evaluations on your traces, and interpret the results. Traces tell us what happened during a run, but they don’t tell us whether the output was good. Evaluations fill that gap by letting us score outputs in a consistent, repeatable way. You’ll work with existing trace data in Arize AX, create an evaluation that defines quality criteria, and run it to score your outputs.

Why Evaluate with the Arize AX Python SDK?

Evaluations are the bridge between “my application ran” and “my application is working well.” Without evaluations, you’re left manually inspecting outputs or relying on user feedback—both of which don’t scale. The Python SDK makes evaluations powerful by:
  • Integrating with your traces — evaluate the same data you’re already collecting
  • Providing programmatic access — run evaluations where you prefer, not just in the UI
  • Supporting multiple evaluation types — LLM-as-a-judge, code-based, and human labels
  • Enabling experimentation — use evaluation results to guide improvements
To follow along, you’ll need to have completed the Tracing tutorial which means we have:
  • Crew AI Financial Analysis and Research Chatbot
  • Trace Data in Arize

Follow along with code

This guide has a companion notebook with runnable code examples. Find it in this notebook.

Step 1: Make Sure You Have Data in Arize

Before we can run evaluations, we need something to evaluate. Evaluations in Arize run over existing trace data. If you followed the tracing guide, you should already have traces containing LLM inputs and outputs Having multiple traces helps you see how quality varies across different runs. Generate more trace data by running your agent with various inputs if you need additional examples to evaluate.
test_queries = [
    {"tickers": "AAPL", "focus": "financial analysis and market outlook"},
    {"tickers": "NVDA", "focus": "valuation metrics and growth prospects"},
    {"tickers": "AMZN", "focus": "profitability and market share"},
    {"tickers": "AAPL, MSFT", "focus": "comparative financial analysis"},
    {"tickers": "META, SNAP, PINS", "focus": "social media sector trends"},
    {"tickers": "RIVN", "focus": "financial health and viability"},
    {"tickers": "SNOW", "focus": "revenue growth trajectory"},
    {"tickers": "KO", "focus": "dividend yield and stability"},
    {"tickers": "META", "focus": "latest developments and stock performance"},
    {"tickers": "AAPL, MSFT, GOOGL, AMZN, META", "focus": "big tech comparison and market outlook"},
    {"tickers": "AMC", "focus": "financial analysis and market sentiment"},
]

for query in test_queries:
    crew.kickoff(inputs=query)

Step 2: Export Your Trace Data

With the Python SDK, you can programmatically export your traces. This gives you a lot of control over your data—you can analyze, filter, and evaluate them at scale.
from arize import ArizeClient
from datetime import datetime, timedelta

client = ArizeClient(api_key=os.getenv("ARIZE_API_KEY"))

# Export spans from the last hour (adjust time range as needed)
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)

df = client.spans.export_to_df(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    project_name="arize-sdk-quickstart",
    start_time=start_time,
    end_time=end_time,
)

parent_spans = df[df["attributes.openinference.span.kind"] == "CHAIN"]

Step 3: Define an Evaluator

Now that we have trace data, the next question is how we decide whether an output is actually good. An evaluation makes that decision explicit. Instead of manually inspecting outputs or relying on intuition, we define a rule that can be applied consistently across many runs. In Arize, evaluations can be written in different ways. In this guide, we’ll use an LLM-as-a-Judge evaluation as a simple starting point. This works well for questions like correctness or relevance, and lets us get metrics quickly. (If you’d rather use code-based evaluations, you can follow the guide on setting those up.) For LLM-as-a-Judge evaluations, that means defining three things:
  • A prompt that describes the judgment criteria
  • An LLM that performs the evaluation
  • The data we want to score
In this step, we’ll define a basic completeness evaluation that checks whether the agent’s output completely answers the input.

Define the Evaluation Prompt

Define the prompt that specifies how the evaluator should judge outputs. The prompt uses attributes.input.value and attributes.output.value to access the input and output data from your spans.
financial_completeness_template = """
You are evaluating whether a financial research report correctly completes ALL parts of the user's task with COMPREHENSIVE coverage.

User input: {attributes.input.value}

Generated report:
{attributes.output.value}

To be marked as "complete", the report MUST meet ALL of these strict requirements:

1. TICKER COVERAGE (MANDATORY):
   - Cover ALL companies/tickers mentioned in the input
   - If multiple tickers are listed, EACH must have dedicated analysis (not just mentioned in passing)
   - For multiple tickers, the report must provide COMPARATIVE analysis when relevant

2. FOCUS AREA COVERAGE (MANDATORY):
   - Address ALL focus areas mentioned in the input
   - If the focus mentions multiple topics (e.g., "earnings and outlook"), BOTH must be thoroughly addressed
   - Each focus area must have substantial content, not just a brief mention

3. FINANCIAL DATA REQUIREMENTS (MANDATORY):
   - For EACH ticker, the report must include:
     * Current/recent stock price or performance data
     * At least 2 key financial ratios (P/E, P/B, debt-to-equity, ROE, etc.)
     * Revenue or earnings information
     * Recent news or developments (within last 6 months)
   - If focus mentions specific metrics (e.g., "P/E ratio"), those MUST be explicitly provided

4. DEPTH REQUIREMENT (MANDATORY):
   - Each ticker must have at least 3-4 sentences of dedicated analysis
   - Generic statements without specific data do NOT count
   - The report must demonstrate thorough research, not superficial coverage

5. COMPARISON REQUIREMENT (if multiple tickers):
   - If 2+ tickers are requested, the report MUST include direct comparisons
   - Comparisons should cover multiple key metrics side-by-side
   - Generic statements like "both companies are good" do NOT satisfy this requirement
   - Must explicitly state which company performs better/worse on specific metrics

The report is "incomplete" if it fails ANY of the above requirements, including:
- Missing any ticker or only mentioning it briefly
- Failing to address any focus area or only addressing it superficially
- Missing required financial data for any ticker
- Providing generic analysis without specific metrics or data
- Failing to provide comparisons when multiple tickers are requested
- Not meeting the depth requirement for any ticker

Respond with ONLY one word: "complete" or "incomplete"
Then provide a detailed explanation of which specific requirements were met or failed.
"""
This prompt defines what completeness means for our application. By making the criteria explicit, we can apply it consistently and understand why outputs pass or fail.

Define the LLM Judge

Choose the LLM model that will perform the evaluation.
from phoenix.evals import LLM

llm = LLM(model="gpt-5", provider="openai")

Create the Evaluator

We’ll use the Phoenix Evals library to run our evaluations by combining the prompt and model into an evaluator. Phoenix Evals provides reusable evaluation primitives (such as create_classifier andLLM) that make it easy to define and run evaluators. We can then use those evaluation results with the Arize AX SDK. While this example shows how to create a custom LLM-as-a-Judge evaluator, there are other ways to create evaluators, including code evaluators for deterministic checks and pre-built LLM evaluator templates for common evaluation scenarios.
from phoenix.evals import create_classifier

completeness_evaluator = create_classifier(
    name="completeness",
    prompt_template=financial_completeness_template,
    llm=llm,
    choices={"complete": 1.0, "incomplete": 0.0},
)
At this point, we’ve defined how to evaluate completeness, but we haven’t run it yet.

Step 4: Run the Evaluation

Now, it’s time to evaluate your traces. We’ll apply the evaluator to the exported trace data. We’re using parent_spans because these represent the top-level agent executions that contain the final outputs we want to evaluate. Child spans are individual LLM calls or tool invocations that are part of the overall workflow, but in this case, we want to score the complete agent output. We’ll use evaluate_dataframe to run the evaluation across all our parent spans.
from phoenix.evals import evaluate_dataframe
from openinference.instrumentation import suppress_tracing

results_df = evaluate_dataframe(dataframe=parent_spans, evaluators=[completeness_evaluator])
This produces evaluation results for each span in the dataset. Each result includes a score, label, and explanation.

Step 5: Log Evaluation Results to Arize

We’ll upload the results back to Arize AX where they’ll appear alongside your traces. This connects quality scores to execution data, giving you a complete picture of your application’s performance in one place. If you prefer, you can view the evaluation results programmatically—the evals_df DataFrame contains all evaluation results including score, label, and explanation for each run of the evaluator.
import pandas as pd
from phoenix.evals.utils import to_annotation_dataframe

# Convert Phoenix eval results to annotation format
annotation_df = to_annotation_dataframe(dataframe=results_df)

# Get span_ids from parent_spans
span_ids = parent_spans["context.span_id"].values

# Build evaluation DataFrame
eval_results = []
eval_name = "completeness"

for idx, span_id in enumerate(span_ids):
    if idx < len(annotation_df):
        eval_result = {
            "context.span_id": span_id,
            f"eval.{eval_name}.label": annotation_df["label"].iloc[idx],
            f"eval.{eval_name}.score": float(annotation_df["score"].iloc[idx]),
            f"eval.{eval_name}.explanation": annotation_df["explanation"].iloc[idx],
        }
        eval_results.append(eval_result)

evals_df = pd.DataFrame(eval_results)

# Upload evaluations to Arize AX
response = client.spans.update_evaluations(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    project_name="arize-sdk-quickstart",
    dataframe=evals_df,
    force_http=True,
)
Here is an example of what evals_df looks like:

| context.span_id  | name         | label      | score | explanation                                       |
|------------------|--------------|------------|-------|---------------------------------------------------|
| 0b23102b9aafe16e | completeness | complete   | 1.0   | The report provided comprehensive coverage for... |
| 9f7634e3055757ef | completeness | incomplete | 0.0   | **Evaluation of Requirements:**\n\n1. **Ticker... |
| bce86b9aad1b21a1 | completeness | complete   | 1.0   | The report comprehensively covers NVIDIA NVDA... |
After the upload completes, check the Arize AX dashboard to see eval results:
Evaluation results in Arize AX dashboard
Congratulations! You’ve run your first evaluation with the Python SDK.

Learn More About Evaluations

With evaluation results logged to Arize AX, you’re ready to learn the workflows for continuous evaluation and improvement. To go deeper on evaluations, the Evaluator Docs cover writing more nuanced evaluators, using different scoring strategies, and comparing quality across runs as your application evolves.