Creating a Custom LLM Evaluator with a Benchmark Dataset

In this tutorial, you’ll learn how to build a custom LLM-as-a-Judge Evaluator tailored to your specific use case. While Arize AX provides several pre-built evaluators that have been tested against benchmark datasets, these may not always cover the nuances of your application. So how can you achieve the same level of rigor when your use case falls outside the scope of standard evaluators? The discipline is simple: don’t trust a judge until you’ve measured it against ground truth. We’ll create a benchmark dataset from a small set of human-annotated examples, then use it to build and refine a custom evaluator — measuring how often the judge agrees with the human labels and iterating until that agreement is high enough to trust. The diagram below provides an overview of the process we will follow in this walkthrough.

Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook.

Building a custom evaluator

Generate Image Classification Traces

In this tutorial, we’ll ask an LLM to generate expense reports from receipt images provided as public URLs. Running the cells below will generate traces, which you can explore directly in Arize AX for annotation. We’ll use gpt-5.4-mini, which supports image inputs.

Dataset Information: Jakob (2024). Receipt or Invoice Dataset. Roboflow Universe. CC BY 4.0. Available at: Roboflow Universe (accessed on 2025‑07‑29)

from openai import OpenAI
client = OpenAI()

MODEL = "gpt-5.4-mini"
JUDGE_MODEL = "gpt-4.1"

def extract_receipt_data(input):
  response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze this receipt and return a brief summary for an expense report. Only include category of expense, total cost, and summary of items"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": input,
                    },
                },
            ],
        }
    ],
  )
  return response

By following the auto-instrumentation setup outlined in the notebook, running the cell below will automatically send traces to Arize AX.

for url in urls:
  extract_receipt_data(url)

Create Benchmark Dataset

After generating traces, open Arize AX to begin annotating your dataset. In this example, we’ll annotate based on “accuracy”, but you can choose any evaluation criterion that fits your use case. Just be sure to update the query below to match the annotation key you’re using—this ensures the annotated examples are included in your benchmark dataset.

Export your traces with the Arize AX client, then keep only the rows you annotated:

from datetime import datetime, timedelta, timezone

from arize.client import ArizeClient

ax_client = ArizeClient(api_key=os.environ["API_KEY"])

primary_df = ax_client.spans.export_to_df(
    space_id=os.environ["SPACE_ID"],
    project_name="receipt-classifications",
    start_time=datetime.now(timezone.utc) - timedelta(days=50),
    end_time=datetime.now(timezone.utc),
)

import json

filtered_df = primary_df[
    (primary_df["annotation.accuracy.label"].notna())
][[
    "attributes.input.value",
    "attributes.output.value",
    "annotation.accuracy.label",
]].rename(columns={
    "attributes.input.value": "image",
    "attributes.output.value": "response",
    "annotation.accuracy.label": "accuracy"
})


def extract_url(input_value):
    data = json.loads(input_value)
    return data["messages"][0]["content"][1]["image_url"]["url"]

def extract_content(input_value):
    data = json.loads(input_value)
    return data["choices"][0]["message"]["content"]

filtered_df["image"] = filtered_df["image"].apply(extract_url)
filtered_df["response"] = filtered_df["response"].apply(extract_content)

filtered_df

Create the dataset from the annotated rows. Arize AX dataset examples are flat dicts, so each row’s image, response, and accuracy columns become fields the task and evaluator read by name:

from datetime import datetime, timezone

DATASET_NAME = f"annotated-receipts-{datetime.now(timezone.utc):%Y%m%d-%H%M%S}"
dataset = ax_client.datasets.create(
    space=os.environ["SPACE_ID"],
    name=DATASET_NAME,
    examples=filtered_df,
)

Build the custom judge & run an experiment

The judge is an LLM-as-a-Judge that reads the receipt image and the model’s expense report, and classifies the report as accurate, almost accurate, or inaccurate — the same labels the human annotator used. make_judge(prompt) binds one judge prompt into an experiment task; the experiment’s evaluator (matches_annotation) then checks whether the judge’s label matches the human annotation, so the experiment score is the judge’s agreement rate with ground truth.

RAILS = ["accurate", "almost accurate", "inaccurate"]

JUDGE_PROMPT_V1 = """You are an evaluator tasked with assessing the quality of a model-generated expense report based on a receipt.

MODEL OUTPUT (Expense Report):
{output}

The input receipt image is attached. Evaluate the report and assign exactly one label:
- "accurate" - Fully correct
- "almost accurate" - Mostly correct
- "inaccurate" - Substantially wrong

Respond with only the label."""


def make_judge(prompt):
    def task_function(dataset_row):
        response = client.chat.completions.create(
            model=JUDGE_MODEL,
            temperature=0,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt.format(output=dataset_row["response"])},
                        {"type": "image_url", "image_url": {"url": dataset_row["image"]}},
                    ],
                }
            ],
        )
        verdict = response.choices[0].message.content.strip().lower()
        # "accurate" is a substring of "almost accurate", so check the longer labels first.
        for rail in ["almost accurate", "inaccurate", "accurate"]:
            if rail in verdict:
                return rail
        return verdict

    return task_function

from arize.experiments import EvaluationResult


def matches_annotation(output, dataset_row) -> EvaluationResult:
    """Score the judge: does its label match the human annotation for this example?"""
    expected = dataset_row["accuracy"]
    if output == expected:
        return EvaluationResult(
            score=1.0, label="correct", explanation="Judge label matches the human annotation"
        )
    return EvaluationResult(
        score=0.0,
        label="incorrect",
        explanation=f"Judge said '{output}', annotation was '{expected}'",
    )

experiment_v1, results_v1 = ax_client.experiments.run(
    space=os.environ["SPACE_ID"],
    dataset=DATASET_NAME,
    task=make_judge(JUDGE_PROMPT_V1),
    evaluators=[matches_annotation],
    name="Initial Experiment",
)
results_v1["eval.matches_annotation.score"].mean()  # agreement with human labels

You will see your experiment result in the experiments tab of your dataset:

Iterate on the judge prompt

Next, we’ll refine the judge prompt by adding more specific classification rules, based on gaps we saw in the previous iteration. We keep the dataset and evaluator constant and change only the prompt, then rerun — so the change in agreement is attributable to the prompt:

JUDGE_PROMPT_V2 = """You are an evaluator tasked with assessing the quality of a model-generated expense report based on a receipt.

MODEL OUTPUT (Expense Report):
{output}

The input receipt image is attached. Evaluate the report and assign exactly one label:
- "accurate" - Total price, itemized list, and expense category are all accurate. All three must be correct to get this label.
- "almost accurate" - Mostly correct but with small issues. For example, the expense category is too vague.
- "inaccurate" - Substantially wrong or missing information. For example, an incorrect total price.

Respond with only the label."""

experiment_v2, results_v2 = ax_client.experiments.run(
    space=os.environ["SPACE_ID"],
    dataset=DATASET_NAME,
    task=make_judge(JUDGE_PROMPT_V2),
    evaluators=[matches_annotation],
    name="Stronger Prompt Experiment",
)

The notebook goes one step further with a few-shot prompt (JUDGE_PROMPT_V3) that spells out what makes a category too vague — the same make_judge / matches_annotation harness, one more prompt.

Results

Each experiment reports the share of examples where the judge’s label matched the human annotation — that agreement rate is how you know whether to trust the judge. Compare the runs in the Experiments tab of your dataset and watch the agreement climb as the prompt is refined. Once your evaluator reaches a performance level you’re satisfied with, it’s ready for use. The target score will depend on your benchmark dataset and specific use case. You can continue applying the techniques from this tutorial to refine and iterate until the evaluator meets your desired level of quality.

​Notebook Walkthrough

Building a custom evaluator

​Generate Image Classification Traces

​Create Benchmark Dataset

​Build the custom judge & run an experiment

​Iterate on the judge prompt

​Results

Notebook Walkthrough

Generate Image Classification Traces

Create Benchmark Dataset

Build the custom judge & run an experiment

Iterate on the judge prompt

Results