The Evaluator
Your go-to blog for insights on AI observability and evaluation.
How We Used Evals (and an AI Agent) to Iteratively Improve an AI Newsletter Generator
We love building little AI-powered tools that accelerate our workflows. One we built recently is a tool that takes our recent tweets and uses Claude to create a draft of…
Arize Skills: Coding Agent Workflows for Traces, Evals, and Instrumentation
Two weeks ago we launched Alyx 2.0, the AI engineering agent inside Arize AX. Last week we launched the AX CLI, which made your trace data headless and agent-readable. Today…
How to Build Planning Into Your Agent (The Architecture That Actually Works)
2025 was supposed to be the year of agents. And for the most part, it wasn’t. The industry was full of hype, demos looked incredible, but when you actually tried…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
From UI to Terminal: Bringing Alyx’s Superpowers Into Your Coding Agent
Last week we launched Alyx 2.0, the in-app AI engineering agent for Arize AX. Alyx replaced clicking through the UI with natural language intent. The AX CLI takes it a…
How to Evaluate Tool-Calling Agents
When you give an LLM access to tools, you introduce a new surface area for failure — and it breaks in two distinct ways: The model selects the wrong tool…
Best AI Observability Tools for Autonomous Agents in 2026
The shift from simple chat interfaces to autonomous agents has broken the traditional monitoring stack. Agentic systems fail in ways that look like success: incorrect but well-formed outputs, unnecessary tool…
Add Observability to Your Open Agent Spec Agents with Arize Phoenix
Open Agent Specification lets you define an agent once and run it on any compatible runtime: LangGraph, WayFlow, CrewAI, and others. That portability solves a real problem in production AI…
AI Agent Debugging: Four Lessons from Shipping Alyx to Production
Building AI systems that actually work in production is harder than it sounds. Not demo-ware, not “it worked once in a notebook.” Real systems that keep working after week two….
Alyx 2.0: The AI Agent That Actually Plans
Two years ago, we started building Alyx with GPT-3.5, a vision, and honestly, no clear path forward. Agents were a buzzword. The models were rough. Tool calling was just emerging….
Mastering Production RAG with Google ADK and Arize AX for Enterprise Knowledge Systems
Introduction Retrieval Augmented Generation (RAG) has become the cornerstone of enterprise AI, yet most organizations struggle with a critical challenge: building RAG systems that work reliably in production. While the…