The Evaluator
Your go-to blog for insights on AI observability and evaluation.
Best AI Observability Tools for Autonomous Agents in 2026
The shift from simple chat interfaces to autonomous agents has broken the traditional monitoring stack. Agentic systems fail in ways that look like success: incorrect but well-formed outputs, unnecessary tool…
Add Observability to Your Open Agent Spec Agents with Arize Phoenix
Open Agent Specification lets you define an agent once and run it on any compatible runtime: LangGraph, WayFlow, CrewAI, and others. That portability solves a real problem in production AI…
AI Agent Debugging: Four Lessons from Shipping Alyx to Production
Building AI systems that actually work in production is harder than it sounds. Not demo-ware, not “it worked once in a notebook.” Real systems that keep working after week two….
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
Alyx 2.0: The AI Agent That Actually Plans
Two years ago, we started building Alyx with GPT-3.5, a vision, and honestly, no clear path forward. Agents were a buzzword. The models were rough. Tool calling was just emerging….
Mastering Production RAG with Google ADK and Arize AX for Enterprise Knowledge Systems
Introduction Retrieval Augmented Generation (RAG) has become the cornerstone of enterprise AI, yet most organizations struggle with a critical challenge: building RAG systems that work reliably in production. While the…
How America First Credit Union Built a GenAI “Decision Explainer” — With Tracing That Scales
America First Credit Union is one of America’s largest independent credit unions, with 1.5 million members and more than $20 billion worth of deposits. As America First Credit Union scaled…
Closing the Loop: Coding Agents, Telemetry, and the Path to Self-Improving Software
2025 marked the widespread adoption of coding agents — harnesses that autonomously write, test, and debug changes to software with minimal human intervention. Products like Claude Code, Codex, Cursor, and…
Inside Typeform’s AI Agent Stack
Typeform is building generative AI experiences to help customers create better forms faster and to make collecting insights feel more natural and useful end-to-end. In this Q&A, Marta Lorens, Senior…
CUGA Agent: From Benchmarks to Business Impact of IBM’s Generalist Agent
This paper reading features several of the researchers — including Segev Shlomov (PhD), Ido Levy, Asaf Adi, and Avi Yaeli — behind the widely acclaimed paper “From Benchmarks to Business…
Top Generative AI Conferences In 2026 for Engineers
GenAI stacks are shifting fast enough that staying current is an ongoing project, not a quarterly refresh. The hard part is separating durable engineering practices (evals, reliability, cost controls, security)…