The Evaluator
Your go-to blog for insights on AI observability and evaluation.

Orchestrator-Worker Agents: A Practical Comparison of Common Agent Frameworks
— Technical deep dive inspired by Anthropic’s “Building Effective Agents” In this piece, we’ll take a close look at the orchestrator–worker agent workflow. We’ll unpack its challenges and nuances, then…

Building a Multilingual Cypher Query Evaluation Pipeline
How to evaluate LLM performance across languages for complex cypher query generation using open source tools As organizations expand globally, the need for multilingual AI systems becomes critical. But how…

Verizon’s Stan Miasnikov Walks Through His Latest Paper On Inter-Agent Communication
In a recent Arize community AI research paper reading, we had the honor to host Stan Miasnikov – Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon – to highlight the…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

New In Arize AX: Experiment Comparisons, Better Data Visualization, and a Dedicated Agent Graph Tab
August was a busy month, with lots of updates from the engineering team to make agent engineering easier. From previewing examples in the UI to a dedicated agent graph tab…

NVIDIA’s Peter Belcak Distills Why Small Language Models are the Future of Agentic AI
In our most recent AI research paper community reading, we had the privilege of hosting Peter Belcak – an AI Researcher working on the reliability and efficiency of agentic systems…

AI Evals Maven Course Homework: the Recipe Bot Workflow
AI Evals for Engineers & PMs is a popular, hands‑on Maven course led by Hamel Husain and Shreya Shankar. The course’s goal is simple: “teach a systematic workflow for evaluating…

Claude Code vs Cursor: A Power-User’s Playbook
Introduction If you spend your days hopping between Cursor’s VS-Code-style panels and Anthropic’s Claude Code CLI, you likely already intuitively know a key fact: while both promise AI-assisted development, they…

Claude Code Observability and Tracing: Introducing Dev-Agent-Lens
Claude Code is excellent for code generation and analysis. Once it lands in a real workflow, though, you immediately need visibility: Which tools are being called, and how reliably? How…

Annotation for Strong AI Evaluation Pipelines
This post walks through how human annotations fit into your evaluation pipeline in Phoenix, why they matter, and how you can combine them with evaluations to build a strong experimentation…

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One
Handshake is the largest early-career network, specializing in connecting students and new grads with employers and career centers. It’s also an engineering powerhouse and innovator in applying AI to its…