The Evaluator
Your go-to blog for insights on AI observability and evaluation.

Atropos Health’s Arjun Mukerji, PhD, Explains RWESummary: A Framework and Test for Choosing LLMs to Summarize Real-World Evidence (RWE) Studies
Large language models are increasingly used to turn complex study output into plain-English summaries. But how do we know which models are safest and most reliable for healthcare? In this…

Rise of the Agent Engineer: Trunk Tools’ Bobby Vinson
Trunk Tools is building the brain behind construction, transforming the $13 trillion construction industry. As a premier AI agent platform for the built environment, Trunk Tools deploys solutions that streamline…

adb Benchmarks
In launching adb (Arize database) we wanted to benchmark adb both internally as a database and at the system level in our application. Our goal is to show both the…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

Orchestrator-Worker Agents: A Practical Comparison of Common Agent Frameworks
— Technical deep dive inspired by Anthropic’s “Building Effective Agents” In this piece, we’ll take a close look at the orchestrator–worker agent workflow. We’ll unpack its challenges and nuances, then…

Building a Multilingual Cypher Query Evaluation Pipeline
How to evaluate LLM performance across languages for complex cypher query generation using open source tools As organizations expand globally, the need for multilingual AI systems becomes critical. But how…

Verizon’s Stan Miasnikov Walks Through His Latest Paper On Inter-Agent Communication
In a recent Arize community AI research paper reading, we had the honor to host Stan Miasnikov – Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon – to highlight the…

New In Arize AX: Experiment Comparisons, Better Data Visualization, and a Dedicated Agent Graph Tab
August was a busy month, with lots of updates from the engineering team to make agent engineering easier. From previewing examples in the UI to a dedicated agent graph tab…

NVIDIA’s Peter Belcak Distills Why Small Language Models are the Future of Agentic AI
In our most recent AI research paper community reading, we had the privilege of hosting Peter Belcak – an AI Researcher working on the reliability and efficiency of agentic systems…

AI Evals Maven Course Homework: the Recipe Bot Workflow
AI Evals for Engineers & PMs is a popular, hands‑on Maven course led by Hamel Husain and Shreya Shankar. The course’s goal is simple: “teach a systematic workflow for evaluating…

Claude Code vs Cursor: A Power-User’s Playbook
Introduction If you spend your days hopping between Cursor’s VS-Code-style panels and Anthropic’s Claude Code CLI, you likely already intuitively know a key fact: while both promise AI-assisted development, they…