The Evaluator
Your go-to blog for insights on AI observability and evaluation.

Sleep-time Compute: Beyond Inference Scaling at Test-time
We recently discussed “Sleep Time Compute: Beyond Inference Scaling at Test Time,” new research from the team at Letta. The paper addresses a key challenge in using powerful AI models:…

New in Arize: Bigger Datasets, Better Evaluations, and Expanded CV Support
April was a big month for Arize, with updates designed to make building, evaluating, and managing your models and prompts even easier. From larger dataset runs in Prompt Playground to…

Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring
In today’s rapidly evolving AI landscape, effective observability into agent systems has become a critical requirement for enterprise applications. This technical guide explores the newly announced integration between Arize AI…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

LibreEval: A Smarter Way to Detect LLM Hallucinations
Over the past few weeks, the Arize team has generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models. We wanted to create a…

40 Large Language Model Benchmarks and The Future of Model Evaluation
With the accelerated development of GenAI, there is a particular focus on its testing and evaluation, resulting in the release of several LLM benchmarks. Each of these benchmarks tests the…

Building and Deploying Observable AI Agents with Google Agent Framework and Arize
Co-authored by Ali Arsanjani, Director of Applied AI Engineering at Google Cloud 1. Introduction: The Dawn of the Agentic Era We have entered into a new era of AI innovation…

Embracing Google’s Agent-To-Agent (A2A) Protocol
We’re excited to announce that Arize AI is partnering with Google as a launch partner for the Agent Interop Protocol (A2A), an open standard enabling seamless communication between AI agents…

Tracing and Evaluating Gemini Audio with Arize
Google’s Gemini models represent a powerful leap forward in multimodal AI, particularly in their ability to process and transcribe audio content with remarkable accuracy. However, even advanced models require robust…

AI Benchmark Deep Dive: Gemini 2.5 and Humanity’s Last Exam
Our latest paper reading provided a comprehensive overview of modern AI benchmarks, taking a close look at Google’s recent Gemini 2.5 release and its performance on key evaluations, notably the…

Model Context Protocol (MCP) from Anthropic
Want to learn more about Anthropic’s groundbreaking Model Context Protocol (MCP)? We break down how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data…