Skip to main content

What is the Evaluator Hub?

The Evaluator Hub is your centralized place for managing evaluators. Instead of rebuilding evaluation logic every time you start a new task, you can define an evaluator once in the Hub and reuse it across projects, datasets, and workflows.

Why Use the Eval Hub?

Without a centralized evaluator library, teams tend to recreate the same evaluation logic across tasks, lose track of what changed, and end up with inconsistent quality criteria across projects. The Eval Hub solves this by giving you a single source of truth for all your evaluators.
  • Reusable across tasks and projects. Create an evaluator once and attach it to any evaluation task — online monitoring, offline batch runs, or dataset experiments. No need to rewrite prompts, reconfigure models, or duplicate code logic.
  • Full version history. Every change to an evaluator is tracked with a commit message. You can see what changed, when, and why — making it easy to audit evaluation criteria over time.
  • Consistent quality standards. When the same evaluator is used across projects, your team applies the same definition of “good” everywhere. This eliminates drift between how different tasks measure performance.
  • Flexible column mappings. Template variables (for LLM-as-a-Judge) and data variables (for code evaluators) map to your datasource columns at the task level, so a single evaluator works across datasets and projects with different schemas.

Navigating the Evaluators Page

The Evaluators page in Arize AX contains two tabs:

Evaluator Hub Tab

This is where evaluators are defined, configured, and managed. Each evaluator card shows its name, type, model configuration, version count, and update history. From here you can:
  • Browse all available evaluators
  • Create new evaluators
  • Edit and version existing evaluators
  • Launch a new task directly from an evaluator

Running Tasks Tab

This is where evaluation tasks execute evaluators against your data. A task connects an evaluator to a data source (project traces or dataset) and runs it on a schedule or as a one-time batch. See Online Evals for more on creating and managing tasks.

Creating an Evaluator in the Eval Hub

Navigate to Evaluators in the left sidebar, then click New Evaluator in the upper right.
The Eval Hub currently supports LLM-as-a-Judge evaluators. Reusable Code evaluators are coming soon.

LLM-as-a-Judge Evaluators

LLM-as-a-Judge evaluators use an LLM to assess outputs based on a structured prompt. There are three ways to create one:

Option A: Use a Pre-Built Template

Arize provides pre-built evaluation templates tested against benchmarked datasets. These cover common evaluation scenarios so you can get started quickly.
  1. Click New Evaluator
  2. Select a template from the list (e.g., Hallucination, Relevance, Toxicity, User Frustration)
  3. Give your evaluator a name — this is how it appears in the Hub and in your results
  4. Configure the LLM settings: select a provider, model, and parameters
  5. Click Save to add it to the Eval Hub
Available pre-built templates include:
TemplateWhat it measures
HallucinationOutputs containing information not supported by the reference
RelevanceWhether responses address the input question
ToxicityHarmful or inappropriate content
HelpfulnessHow useful the response is to the user
Q&A CorrectnessAnswer accuracy given reference documents
SummarizationWhether summaries capture the source material
User FrustrationSigns of frustration in conversations
Code GenerationCode correctness and readability
SQL GenerationSQL query correctness
Tool CallingFunction call accuracy and parameter extraction

Option B: Create from Blank

Build a custom evaluator when pre-built templates don’t capture your application-specific criteria.
  1. Click New Evaluator, then select Create From Blank
  2. Name the evaluator descriptively (e.g., “Travel Plan Completeness”, “Regulatory Compliance Check”)
  3. Write your prompt template — describe the judge’s role, evaluation criteria, and include template variables (e.g., {input}, {output}, {context}) that will be populated with your data
  4. Define output labels — set the possible values the judge can return (e.g., correct/incorrect, or a 1–5 scale) along with their scores
  5. Configure the judge model — select the AI provider, model, and parameters
  6. Toggle Explanations to “On” if you want the judge to provide a rationale for each label
  7. Click Save
Categorical labels (e.g., correct/incorrect) tend to be more reliable and consistent than numeric scores for most evaluation tasks.

Option C: Use Alyx to Generate an Evaluator

Alyx can generate custom evaluators from plain language descriptions.
  1. Click the Alyx icon in the upper right corner
  2. Describe what you want to evaluate in plain language, for example:
    Write a custom evaluation that checks if customer support responses
    are empathetic, address the customer's concern, and provide actionable
    next steps. Score from 1-5.
    
  3. Review the generated evaluator — adjust the template, labels, or model as needed
  4. Save to the Eval Hub

Code Evaluators (⚠️ Coming Soon)

Code evaluators use deterministic logic — Python code — to score outputs. They’re ideal for objective checks like regex matching, JSON validation, keyword presence, or any custom heuristic.
Reusable Code Evaluators will be available in the Evaluator Hub soon. In the meantime, you can add code evaluators directly to your task.

Versioning Evaluators

As your understanding of “good” evolves, your evaluators should too. The Eval Hub tracks every change to an evaluator with a version history.

How Versioning Works

  • Each time you edit and save an evaluator, a new version is created
  • You’re prompted to add a commit message describing what changed (e.g., “Tightened criteria for budget accuracy”, “Added edge case for multi-city trips”)
  • The full version history is visible on the evaluator detail page

Best Practices for Versioning

  • Write descriptive commit messages. Future you (and your teammates) will thank you when reviewing why evaluation criteria shifted.
  • Version after testing. Use the Evaluator Playground to test changes before committing a new version.
  • Review version history before modifying. Check what the evaluator currently does and why recent changes were made before introducing new edits.

Reusing an Evaluator Across Tasks

The core value of the Eval Hub is reuse. Once an evaluator is saved, you can attach it to any evaluation task without recreating it.

Attaching an Evaluator to a Task

There are two ways to use an existing evaluator: From the task creation flow:
  1. Click New Task on the Evaluators page
  2. Select the evaluator type (LLM-as-a-Judge or Code Evaluator)
  3. Click Add Evaluator, then choose your evaluator from the Eval Hub
From the Eval Hub directly:
  1. Navigate to the Eval Hub tab
  2. Find the evaluator you want to use
  3. Click Use Evaluator — this opens the task creation flow with that evaluator pre-selected

Configuring Column Mappings

When you attach an evaluator to a task, you may need to map its variables to your datasource columns. This is what makes evaluators truly portable — the same evaluator can work with different data schemas.
  1. After adding the evaluator to a task, the column mappings panel shows all variables — prompt template variables for LLM-as-a-Judge evaluators, or data variables for code evaluators
  2. For each variable (e.g., {input}, {output}, {context}), select the corresponding column from your datasource
  3. If the variable names match your datasource columns, mappings are configured automatically
Column mappings are configured at the task level, not the evaluator level. This means a single evaluator can be mapped differently for different projects or datasets.

Example Workflow: End-to-End

Here’s how a typical workflow looks using the Eval Hub:
1

Create an Evaluator in the Hub

Navigate to Evaluators > New Evaluator. Choose an LLM-as-a-Judge evaluator (pre-built template, custom, or Alyx-generated) or a code evaluator. Configure the settings, define your labels, and save it.
2

Test it in the Playground

Before running at scale, test your evaluator against sample data. For LLM-as-a-Judge evaluators, use the Playground to refine the prompt template until you’re confident in the results.
3

Attach it to a Task

Create a task that runs the evaluator continuously on existing or production traces. Configure filters, sampling rate, and column mappings. See Setting Up Online Evals for the full setup guide.
4

Reuse it on a different project

When you start a new project with similar quality criteria, go back to the Eval Hub and attach the same evaluator to a new task — just update the column mappings to match the new data schema.
5

Version it as criteria evolve

As you learn more about what “good” looks like for your application, edit the evaluator and commit a new version with a descriptive message. All tasks using this evaluator will pick up the latest version.