The Evaluator
Your go-to blog for insights on AI observability and evaluation.

Claude Code vs Cursor: A Power-User’s Playbook
Introduction If you spend your days hopping between Cursor’s VS-Code-style panels and Anthropic’s Claude Code CLI, you likely already intuitively know a key fact: while both promise AI-assisted development, they…

Claude Code Observability and Tracing: Introducing Dev-Agent-Lens
Claude Code is excellent for code generation and analysis. Once it lands in a real workflow, though, you immediately need visibility: Which tools are being called, and how reliably? How…

Annotation for Strong AI Evaluation Pipelines
This post walks through how human annotations fit into your evaluation pipeline in Phoenix, why they matter, and how you can combine them with evaluations to build a strong experimentation…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One
Handshake is the largest early-career network, specializing in connecting students and new grads with employers and career centers. It’s also an engineering powerhouse and innovator in applying AI to its…

Evidence-Based Prompting Strategies for LLM-as-a-Judge: Explanations and Chain-of-Thought
When LLMs are used as evaluators, two design choices often determine the quality and usefulness of their judgments: whether to require explanations for decisions, and whether to use explicit chain-of-thought…

Trace-Level LLM Evaluations with Arize AX
Most commonly, we hear about evaluating LLM applications at the span level. This involves checking whether a tool call succeeded, whether an LLM hallucinated, or whether a response matched expectations….

Session-Level Evaluations with Arize AX
When evaluating AI applications, we often look at things like tool calls, parameters, or individual model responses. While this span-level evaluation is useful, it doesn’t always capture the bigger picture…

LLM-as-a-Judge: Example of How To Build a Custom Evaluator Using a Benchmark Dataset
When To Build Custom Evaluators Arize-Phoenix ships with pre-built evaluators that are tested against benchmark datasets and tuned for repeatability. They’re a fast way to stand up rigorous evaluation for…

ADB Database: Realtime Ingestion At Scale
We put out our first blog on the introducing the Arize database – ADB – in the beginning of July; this blog dives deeper into the realtime ingestion support of…

New In Arize AX: Prompt Learning, Arize Tracing Assistant, and Multiagent Visualization
July was a big month for Arize AX, with updates to make AI and agent engineering much easier. From prompt learning to new skills for Alyx and OpenInference Java, there…