- Build a small domain dataset of emails + their correct extractions — your benchmark, not a public one
- Define an extraction task with a fixed schema and prompt, parameterized only by the model
- Define two evaluators — string similarity and field-level accuracy — and see how they can rank the models differently
- Run the same harness across
gpt-5.4-miniand the flagshipgpt-5.5and compare them fairly in Arize AX
Notebook Walkthrough
We will go through key code snippets on this page. To follow the full tutorial, check out the notebook.Building your own eval harness
Experiments in Arize AX
An Arize AX experiment is made of three elements: a dataset (the inputs and expected outputs), a task (run once per example), and one or more evaluators (score the task’s output). Hold the dataset and evaluators constant, swap only the model, and the comparison is fair by construction.Build a domain dataset
This is the part public benchmarks can’t do for you. We hand-label a handful of emails the way our service actually sees them, each paired with the exact structured output we want back:sender, category, summary, action_required, and due_date. Arize AX dataset examples are flat dicts, so each row is the email plus its expected fields.
Define the extraction task
The task is what we hold almost constant: the same schema, the same prompt, the same parsing — only the model changes. We use OpenAI’s structured outputs so every model returns the exact same shape. The experiment passes each example’s row asdataset_row; we read the email out of it.
Choose metrics that measure what you care about
This is where benchmarks quietly lie. We score the same outputs two ways:jaro_winkler (forgiving string similarity, the right tool for the free-text summary) and field_accuracy (exact match on the operational fields downstream code depends on). The two metrics measure different things, so they can rank the models differently — and when they disagree, the metric that should decide is the one tied to your downstream needs.
Each evaluator receives the task’s output and the example’s dataset_row, and returns an EvaluationResult (Arize AX requires a score plus a non-null label and explanation).
summary; B has an identical summary but its due_date is off by a day. field_accuracy prefers A (all operational fields correct), while jaro_winkler prefers B (it looks almost identical) — yet B’s one-day date slip is exactly what breaks downstream code. Same outputs, opposite rankings, and the strict metric is the right one to trust.
Run the same harness across models
Same dataset, same evaluators, same prompt — we change only themodel argument. Each run(...) uploads its results to Arize AX and returns a results dataframe.
Compare fairly
Each run returns a results dataframe with aneval.<name>.score column per evaluator. Roll those up per metric, then open the dataset’s Experiments tab in Arize AX to compare the two runs example-by-example.
View Results
Open the dataset’s Experiments tab to see the runs side by side. Each experiment is a row; each evaluator becomes its own score column (sofield_accuracy sits next to jaro_winkler), alongside operational columns — average latency, cost, and error rate — that matter for a real model choice but never show up on a public leaderboard.
Sort by the metric tied to your downstream needs (field_accuracy) rather than the forgiving one, then click any row to drop into the example-level view: the input email, the model’s extraction, and each evaluator’s score for that single example. That’s where you see why one model wins — a due_date the model dropped, a sender it over-captured — instead of trusting an aggregate.
A public benchmark tells you how a model does on someone else’s task, with someone else’s metric. An Arize AX experiment built from your own data, with a metric tied to your downstream needs, gives you a number you can actually defend — and comparing models, prompts, or providers becomes just swapping one argument and reading it.