The Evaluator
Your go-to blog for insights on AI observability and evaluation.
CUGA Agent: From Benchmarks to Business Impact of IBM’s Generalist Agent
This paper reading features several of the researchers — including Segev Shlomov (PhD), Ido Levy, Asaf Adi, and Avi Yaeli — behind the widely acclaimed paper “From Benchmarks to Business…
Top Generative AI Conferences In 2026 for Engineers
GenAI stacks are shifting fast enough that staying current is an ongoing project, not a quarterly refresh. The hard part is separating durable engineering practices (evals, reliability, cost controls, security)…
New In Arize AX: January 2026 Updates
Arize AX pushed out a lot of new updates in January 2026. From improved evaluator hub to custom prompt release labels, here are some highlights. Evaluator Hub: Reusable Evaluators We’re…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
How Nebulock Democratizes Threat Hunting
Nebulock is on a mission to democratize threat hunting. Instead of relying only on deterministic rules or reacting to alerts as they come in, the team builds AI agents that…
Why AI Agents Break: A Field Analysis of Production Failures
As AI agents enter production environments, they face conditions their training does not cover. These systems generate fluent output, yet operational work demands exact action. Small ambiguities compound fast when…
OWASP Top 10 for Agentic Applications: Compliance Guide
This guide maps the OWASP Agentic Security Initiative (ASI) top ten risks to specific Arize AX observability features and metrics you should implement to detect, monitor, and mitigate threats in…
Hierarchical Memory Management In Agent Harnesses
We’ve worked with thousands of customers building AI agents, and we’ve also spent the last two years building our own agent, Alyx, an in-product assistant for Arize AX. These experiences…
How Observability-Driven Sandboxing Secures AI Agents
AI agents become dangerous at the moment they gain the ability to execute actions. The moment an agent can touch the file system or invoke external tools, safety shifts from…
AI Agent interfaces In 2026: Filesystem vs API vs Database (What Actually Works)
We Don’t Know How to Build Agent Interfaces Yet (And That’s Fine) Letta just published benchmark results showing a filesystem-based agent scored 74% on memory tasks by simply storing conversation…
Google Antigravity and Arize AX’s MCP Tracing Assistant: How to Trace Your Agent Without Writing Any Code
TL;DR: Add the Arize AX MCP server to Antigravity to instrument your AI applications without leaving your IDE. Instrumenting AI applications with tracing and observability is critical for debugging, monitoring,…