The Evaluator
Your go-to blog for insights on AI observability and evaluation.
Arize Skills: Coding Agent Workflows for Traces, Evals, and Instrumentation
Two weeks ago we launched Alyx 2.0, the AI engineering agent inside Arize AX. Last week we launched the AX CLI, which made your trace data headless and agent-readable. Today…
How to Build Planning Into Your Agent (The Architecture That Actually Works)
2025 was supposed to be the year of agents. And for the most part, it wasn’t. The industry was full of hype, demos looked incredible, but when you actually tried…
From UI to Terminal: Bringing Alyx’s Superpowers Into Your Coding Agent
Last week we launched Alyx 2.0, the in-app AI engineering agent for Arize AX. Alyx replaced clicking through the UI with natural language intent. The AX CLI takes it a…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
How to Evaluate Tool-Calling Agents
When you give an LLM access to tools, you introduce a new surface area for failure — and it breaks in two distinct ways: The model selects the wrong tool…
Best AI Observability Tools for Autonomous Agents in 2026
The shift from simple chat interfaces to autonomous agents has broken the traditional monitoring stack. Agentic systems fail in ways that look like success: incorrect but well-formed outputs, unnecessary tool…
Add Observability to Your Open Agent Spec Agents with Arize Phoenix
Open Agent Specification lets you define an agent once and run it on any compatible runtime: LangGraph, WayFlow, CrewAI, and others. That portability solves a real problem in production AI…
AI Agent Debugging: Four Lessons from Shipping Alyx to Production
Building AI systems that actually work in production is harder than it sounds. Not demo-ware, not “it worked once in a notebook.” Real systems that keep working after week two….
Alyx 2.0: The AI Agent That Actually Plans
Two years ago, we started building Alyx with GPT-3.5, a vision, and honestly, no clear path forward. Agents were a buzzword. The models were rough. Tool calling was just emerging….
Mastering Production RAG with Google ADK and Arize AX for Enterprise Knowledge Systems
Introduction Retrieval Augmented Generation (RAG) has become the cornerstone of enterprise AI, yet most organizations struggle with a critical challenge: building RAG systems that work reliably in production. While the…
How America First Credit Union Built a GenAI “Decision Explainer” — With Tracing That Scales
America First Credit Union is one of America’s largest independent credit unions, with 1.5 million members and more than $20 billion worth of deposits. As America First Credit Union scaled…