The Evaluator
Your go-to blog for insights on AI observability and evaluation.

Evidence-Based Prompting Strategies for LLM-as-a-Judge: Explanations and Chain-of-Thought
When LLMs are used as evaluators, two design choices often determine the quality and usefulness of their judgments: whether to require explanations for decisions, and whether to use explicit chain-of-thought…

Guide to Trace-Level LLM Evaluations with Arize AX
Most commonly, we hear about evaluating LLM applications at the span level. This involves checking whether a tool call succeeded, whether an LLM hallucinated, or whether a response matched expectations….

Session-Level Evaluations with Arize AX
When evaluating AI applications, we often look at things like tool calls, parameters, or individual model responses. While this span-level evaluation is useful, it doesn’t always capture the bigger picture…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

LLM-as-a-Judge: Example of How To Build a Custom Evaluator Using a Benchmark Dataset
When To Build Custom Evaluators Arize-Phoenix ships with pre-built evaluators that are tested against benchmark datasets and tuned for repeatability. They’re a fast way to stand up rigorous evaluation for…

ADB Database: Realtime Ingestion At Scale
We put out our first blog on the introducing the Arize database – ADB – in the beginning of July; this blog dives deeper into the realtime ingestion support of…

New In Arize AX: Prompt Learning, Arize Tracing Assistant, and Multiagent Visualization
July was a big month for Arize AX, with updates to make AI and agent engineering much easier. From prompt learning to new skills for Alyx and OpenInference Java, there…

A Watermark for Large Language Models
In our latest live AI research papers community reading, the primary author of the popular paper A Watermark For Large Language Models (John Kirchenbauer of University of Maryland) walked us…

Unlocking Safer AI: Your Two-Part Field Guide
Large language models are reshaping how we build products — and how adversaries try to break them. To help teams stay ahead, Sofia Jakovcevic — AI Solutions Engineer at Arize…

LLM Observability for AI Agents and Applications
The era of single-turn LLM calls is behind us. Today’s AI products are powered by increasingly autonomous agents — multi-step systems that plan, reason, use tools, and adapt in real…

Prompt Learning: Using English Feedback to Optimize LLM Systems
Applications of reinforcement learning (RL) in AI model building has been a growing topic over the past few months. From Deepseek models incorporating RL mechanics into their training processes to…