Why Public Benchmarks Lie: Building Your Own Eval Harness

When a new model tops a public leaderboard, it’s tempting to assume it’s the right choice for your application. But public benchmarks measure generic capabilities on generic data with a generic metric — and your task is narrow and specific. The only benchmark that predicts how a model performs on your task is a benchmark built from your task. In this tutorial you’ll build that harness for an email text-extraction service and use it to compare two models fairly. You will:

Build a small domain dataset of emails + their correct extractions — your benchmark, not a public one
Define an extraction task with a fixed schema and prompt, parameterized only by the model
Define two evaluators — string similarity and field-level accuracy — and see how they can rank the models differently
Run the same harness across gpt-5.4-mini and the flagship gpt-5.5 and compare them fairly in Arize AX

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook.

Building your own eval harness

Experiments in Arize AX

An Arize AX experiment is made of three elements: a dataset (the inputs and expected outputs), a task (run once per example), and one or more evaluators (score the task’s output). Hold the dataset and evaluators constant, swap only the model, and the comparison is fair by construction.

Build a domain dataset

This is the part public benchmarks can’t do for you. We hand-label a handful of emails the way our service actually sees them, each paired with the exact structured output we want back: sender, category, summary, action_required, and due_date. Arize AX dataset examples are flat dicts, so each row is the email plus its expected fields.

from datetime import datetime, timezone

import pandas as pd
from arize.client import ArizeClient

ax_client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
SPACE_ID = os.environ["ARIZE_SPACE_ID"]

# rows = [{"email": ..., "sender": ..., "category": ..., "summary": ...,
#          "action_required": ..., "due_date": ...}, ...]
df = pd.DataFrame(rows)
OUTPUT_KEYS = ["sender", "category", "summary", "action_required", "due_date"]

DATASET_NAME = f"email-extraction-{datetime.now(timezone.utc):%Y%m%d-%H%M%S}"
dataset = ax_client.datasets.create(name=DATASET_NAME, space=SPACE_ID, examples=df)
print(f"Uploaded {len(df)} examples to dataset '{DATASET_NAME}' (id: {dataset.id})")

The printed dataset name and id let you find it quickly in the Datasets tab in Arize AX.

Define the extraction task

The task is what we hold almost constant: the same schema, the same prompt, the same parsing — only the model changes. We use OpenAI’s structured outputs so every model returns the exact same shape. The experiment passes each example’s row as dataset_row; we read the email out of it.

from typing import Literal

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()


class EmailExtraction(BaseModel):
    sender: str
    category: Literal["meeting", "invoice", "support_request", "sales", "internal_update"]
    summary: str
    action_required: bool
    due_date: str  # ISO date (YYYY-MM-DD) or the literal string "none"


def make_task(model: str):
    def task(dataset_row) -> dict:
        response = client.beta.chat.completions.parse(
            model=model,
            messages=[{"role": "user", "content": PROMPT.format(email=dataset_row["email"])}],
            response_format=EmailExtraction,
        )
        return response.choices[0].message.parsed.model_dump()

    return task

Choose metrics that measure what you care about

This is where benchmarks quietly lie. We score the same outputs two ways: jaro_winkler (forgiving string similarity, the right tool for the free-text summary) and field_accuracy (exact match on the operational fields downstream code depends on). The two metrics measure different things, so they can rank the models differently — and when they disagree, the metric that should decide is the one tied to your downstream needs. Each evaluator receives the task’s output and the example’s dataset_row, and returns an EvaluationResult (Arize AX requires a score plus a non-null label and explanation).

import json

import jarowinkler
from arize.experiments import EvaluationResult

OPERATIONAL_FIELDS = ["sender", "category", "action_required", "due_date"]


def _expected(dataset_row) -> dict:
    return {k: dataset_row[k] for k in OUTPUT_KEYS}


def jaro_winkler(output, dataset_row) -> EvaluationResult:
    score = jarowinkler.jarowinkler_similarity(
        json.dumps(output, sort_keys=True),
        json.dumps(_expected(dataset_row), sort_keys=True),
    )
    return EvaluationResult(
        score=score, label=f"{score:.2f}", explanation="JSON string similarity vs expected"
    )


def field_accuracy(output, dataset_row) -> EvaluationResult:
    expected = _expected(dataset_row)
    matches = sum(
        1
        for k in OPERATIONAL_FIELDS
        if str(output.get(k)).strip().lower() == str(expected[k]).strip().lower()
    )
    score = matches / len(OPERATIONAL_FIELDS)
    return EvaluationResult(
        score=score,
        label=f"{matches}/{len(OPERATIONAL_FIELDS)}",
        explanation=f"{matches} of {len(OPERATIONAL_FIELDS)} operational fields matched exactly",
    )


EVALUATORS = [jaro_winkler, field_accuracy]

The disagreement is easiest to see deterministically. Take two candidate extractions for the same invoice email: A has every operational field right but a fully reworded summary; B has an identical summary but its due_date is off by a day. field_accuracy prefers A (all operational fields correct), while jaro_winkler prefers B (it looks almost identical) — yet B’s one-day date slip is exactly what breaks downstream code. Same outputs, opposite rankings, and the strict metric is the right one to trust.

Run the same harness across models

Same dataset, same evaluators, same prompt — we change only the model argument. Each run(...) uploads its results to Arize AX and returns a results dataframe.

experiment_mini, results_mini = ax_client.experiments.run(
    name=f"gpt-5.4-mini-{DATASET_NAME}",
    dataset=DATASET_NAME,
    space=SPACE_ID,
    task=make_task("gpt-5.4-mini"),
    evaluators=EVALUATORS,
)

experiment_full, results_full = ax_client.experiments.run(
    name=f"gpt-5.5-{DATASET_NAME}",
    dataset=DATASET_NAME,
    space=SPACE_ID,
    task=make_task("gpt-5.5"),
    evaluators=EVALUATORS,
)

Compare fairly

Each run returns a results dataframe with an eval.<name>.score column per evaluator. Roll those up per metric, then open the dataset’s Experiments tab in Arize AX to compare the two runs example-by-example.

def average_scores(results) -> dict:
    return {
        "jaro_winkler": results["eval.jaro_winkler.score"].mean(),
        "field_accuracy": results["eval.field_accuracy.score"].mean(),
    }


summary = pd.DataFrame(
    {
        "gpt-5.4-mini": average_scores(results_mini),
        "gpt-5.5": average_scores(results_full),
    }
)
print(summary)

View Results

Open the dataset’s Experiments tab to see the runs side by side. Each experiment is a row; each evaluator becomes its own score column (so field_accuracy sits next to jaro_winkler), alongside operational columns — average latency, cost, and error rate — that matter for a real model choice but never show up on a public leaderboard. Sort by the metric tied to your downstream needs (field_accuracy) rather than the forgiving one, then click any row to drop into the example-level view: the input email, the model’s extraction, and each evaluator’s score for that single example. That’s where you see why one model wins — a due_date the model dropped, a sender it over-captured — instead of trusting an aggregate. A public benchmark tells you how a model does on someone else’s task, with someone else’s metric. An Arize AX experiment built from your own data, with a metric tied to your downstream needs, gives you a number you can actually defend — and comparing models, prompts, or providers becomes just swapping one argument and reading it.

​Notebook Walkthrough

Building your own eval harness

​Experiments in Arize AX

​Build a domain dataset

​Define the extraction task

​Choose metrics that measure what you care about

​Run the same harness across models

​Compare fairly

​View Results