The Evaluator

Your go-to blog for insights on AI observability and evaluation.

Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring
Agents

Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring

In today’s rapidly evolving AI landscape, effective observability into agent systems has become a critical requirement for enterprise applications. This technical guide explores the newly announced integration between Arize AI…

LibreEval: A Smarter Way to Detect LLM Hallucinations
Paper Readings Research

LibreEval: A Smarter Way to Detect LLM Hallucinations

Over the past few weeks, the Arize team has generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models. We wanted to create a…

Evaluating Large Language Models: Are Modern Benchmarks Sufficient?
Large Language Models LLM Evals

Evaluating Large Language Models: Are Modern Benchmarks Sufficient?

With the accelerated development of GenAI, there is a particular focus on its testing and evaluation, resulting in the release of several LLM benchmarks. Each of these benchmarks tests the…

Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

Building and Deploying Observable AI Agents with Google Agent Framework and Arize
Agents AI In the Enterprise Generative AI

Building and Deploying Observable AI Agents with Google Agent Framework and Arize

Co-authored by Ali Arsanjani, Director of Applied AI Engineering at Google Cloud 1. Introduction: The Dawn of the Agentic Era We have entered into a new era of AI innovation…

Arize AI and the Future of Agent Interoperability: Embracing Google’s A2A Protocol
Agents AI In the Enterprise

Arize AI and the Future of Agent Interoperability: Embracing Google’s A2A Protocol

We’re excited to announce that Arize AI is partnering with Google as a launch partner for the Agent Interop Protocol (A2A), an open standard enabling seamless communication between AI agents…

Tracing and Evaluating Gemini Audio with Arize
Generative AI LLM Evals

Tracing and Evaluating Gemini Audio with Arize

Google’s Gemini models represent a powerful leap forward in multimodal AI, particularly in their ability to process and transcribe audio content with remarkable accuracy. However, even advanced models require robust…

AI Benchmark Deep Dive: Gemini 2.5 and Humanity’s Last Exam
Paper Readings Research

AI Benchmark Deep Dive: Gemini 2.5 and Humanity’s Last Exam

Our latest paper reading provided a comprehensive overview of modern AI benchmarks, taking a close look at Google’s recent Gemini 2.5 release and its performance on key evaluations, notably the…

Model Context Protocol
Agents Paper Readings

Model Context Protocol

Want to learn more about Anthropic’s groundbreaking Model Context Protocol (MCP)? We break down how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data…

Self-Improving Agents: Automating LLM Performance Optimization using Arize and NVIDIA NeMo
Agents Company Generative AI

Self-Improving Agents: Automating LLM Performance Optimization using Arize and NVIDIA NeMo

Enterprises face a critical challenge in keeping their LLM models accurate and reliable over time. Traditional model improvement approaches are slow, manual, and reactive, making it difficult to scale and…

Prompt Optimization Techniques
Generative AI Large Language Models Phoenix

Prompt Optimization Techniques

LLMs are powerful tools, but their performance is heavily influenced by how prompts are structured. The difference between an effective and ineffective prompt can determine whether a model produces accurate…