Evaluators
What are Evals?
Evals provide a way to measure how well a system performs on a task. When you adjust your prompts, agents, or retrieval, you learn whether performance has improved or not. Instead of relying on intuition or ad-hoc testing, evals turn subjective judgments into measurable results.
This makes it possible to see how an AI system performs over time, identify regressions early, and determine when an application is ready for production. Once in production, evals continue to ensure that the application behaves as intended.
Why Evals Matter?
Evals are critical for any system—not just AI applications—as they provide a consistent way to measure performance. With LLMs, evaluation becomes especially important due to the inherently subjective nature of many generative tasks.
Evals help teams quantify any performance criteria, uncover weaknesses, and make confident tradeoffs between speed, cost, and quality. With consistent evals in place, you can track improvements over time, align on what “good” performance looks like, and ensure your AI applications behave as expected in development and production.
Together, online and offline evals create a comprehensive evaluation framework that balances experimentation and reliability. Offline evals guide model improvements before deployment, while online evals ensure those improvements hold up in real-world conditions.
Why Arize for Evals?
Arize Evals are built on top of our open source Phoenix Evals library. Everything is testable in code, with lightweight, composable building blocks that make it easy to customize & run evaluations programmatically.
Fast: Evals are optimized for speed, achieving up to 20× faster performance through built-in batching and concurrency. It is the fastest evaluation library in the industry, running efficiently at scale without sacrificing accuracy.
Wide variety of prebuilt evaluations: Arize includes a growing set of ready-to-use evals for common LLM evaluation tasks such as hallucination detection, relevance, and faithfulness.
Reusable evaluators: Create evaluators once and reuse them across datasets and projects. The Evaluator Hub makes it easy to explore, manage, and version your evaluators in one place.
Built-in explanations: Every evaluation includes an explanation flag so you can understand why a response passed or failed.
Model agnostic: Evals works with any foundation model you choose, including GPT, Claude, and others.
What is an Evaluator?
An Evaluator defines how your evaluation runs. It’s the bridge between raw outputs and meaningful insights. Instead of treating results as static text, Evaluators turn them into structured feedback that helps you measure what matters most. By standardizing how feedback is collected, you can compare experiments consistently, track improvements, and align evaluations with your goals.
An Evaluator includes the following fields:
Name is the human-readable identifier for the evaluator or metric. For example, correctness, faithfulness, or toxicity.
Source specifies where the evaluation logic comes from. There are two main evaluation types:
LLM-as-a-Judge: Leverages an LLM to assess outputs based on a structured prompt. This is ideal for subjective or nuanced judgments (ex: helpfulness, tone, coherence)
Code (Heuristic): Uses code and deterministic logic to score outputs (ex: exact match or regex-based checks)
Evaluation Definition defines how the evaluation is performed:
LLM-as-a-Judge: Includes an evaluation template with prompt variables that capture context (ex: question, answer, reference). The LLM uses this prompt to generate a label, score, and explanation..
Code (Heuristic): Includes a code template with data variables passed into the evaluation function. The code executes and returns a structured result.
Optimization Direction allows us to indicate whether a high score is better or worse. This tells the system what “good” means, so metrics can be compared and visualized consistently.
Output captures the evaluation results. There are a few core dimensions to this.
Score: A numeric value that quantifies performance; higher or lower may indicate better results depending on the optimization direction.Label: A categorical outcome or qualitative tag (ex: incorrect/correct, relevant/irrelevant, helpful/unhelpful)Explanation: A brief rationale or justification for the result, offering context behind each score or label.

Eval Best Practices
Last updated
Was this helpful?

