The Evaluator
Your go-to blog for insights on AI observability and evaluation.

LLM Observability for AI Agents and Applications
The era of single-turn LLM calls is behind us. Today’s AI products are powered by increasingly autonomous agents — multi-step systems that plan, reason, use tools, and adapt in real…

Prompt Learning: Using English Feedback to Optimize LLM Systems
Applications of reinforcement learning (RL) in AI model building has been a growing topic over the past few months. From Deepseek models incorporating RL mechanics into their training processes to…

Self-Adapting Language Models: Paper Authors Discuss Implications
In a recent live AI research paper reading, the authors of the new paper Self-Adapting Language Models (SEAL) shared a behind-the-scenes look at their work, motivations, results, and future directions….
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:

Meet Alyx: Arize’s Evolving AI Agent
We’re excited to introduce Alyx, the next evolution in Arize’s intelligent assistant. You might remember our first iteration — Copilot — launched last year as a set of tools to…

Introducing ADB: Arize’s Proprietary OLAP Database
Earlier this month, we rolled out real‑time ingestion support to every Arize AX workspace—paid and free. With that launch, Arize now ingests terabytes of data every day across hundreds of…

Arize Observe 2025 – Product Releases
Arize Observe 2025 brought a wealth of new product releases, including a redesigned copilot, agent eval options, and state-of-the-art prompt optimization techniques. Check them all out below! Copilot v3: Alyx…

The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning
A recent paper from Apple researchers—The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity—has stirred up significant discussion in the AI…

Introducing GraphQL for Humans – Building a Text-To-GraphQL Agent In a Weekend
Working with GraphQL can often feel overwhelming, especially when you’re navigating massive schemas with tens of thousands of lines. Writing GraphQL queries is often a time-consuming task prone to errors,…

Accurate KV Cache Quantization with Outlier Tokens Tracing
Deploying large language models (LLMs) at scale is expensive—especially during inference. One of the biggest memory and performance bottlenecks? The KV Cache. In a new research paper, Accurate KV Cache…

New in Arize: Realtime Trace Ingestion, Prompt Playground Upgrades & More
In May, we expanded access to realtime trace ingestion across all Arize AX tiers, making it easier than ever to monitor LLM performance live. We also rolled out major usability…