The Evaluator
Your go-to blog for insights on AI observability and evaluation.
Long-horizon agent benchmarks are fragmenting: a field guide to what each one actually measures
A field guide to the new wave of long-horizon agent benchmarks: what each one actually measures, the realism-versus-verifiability bargain it strikes, and the seam where its score leaks.
Project Rosetta Stone: a reference implementation for instrumenting agents in any framework
We’ve fielded the same question at every conference this year. An engineer has chosen a framework, CrewAI one week, LangGraph the next, Mastra the week after, and wants to see exactly how observability plugs into the one they picked. OpenInference defines the span vocabulary, the
Why AI token costs don’t tell you if your AI is working
Token spend does not prove AI is creating value. Teams need cost-per-outcome metrics that connect AI usage to resolved tickets, accepted code, shipped features, and other business results.
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
Meet PXI: the AI engineering agent inside Phoenix
An AI engineering agent built into Phoenix. It works like a coding agent, just point it at your telemetry instead of a source tree.
What is an agent harness? Why harnesses are replacing agent frameworks
Agent harnesses are replacing frameworks as the real product surface for reliable AI agents, shifting the work from prompt tuning to loops, tools, traces, evals, and operational metrics.
Two labs started dreaming, and they built two different architectures
Anthropic and OpenAI both shipped ‘dreaming’ for AI memory in May and June 2026, and they built opposite architectures. A look at what each lab shipped, what the empirical literature says, and what to do if you are building memory for your own agent.
What is agent orchestration? Frameworks, runtimes, and observability explained
Agent orchestration is not one problem. It spans expression, runtime, and observability, and separating those layers clarifies how teams should build, run, and improve production agents.
One agent, two trace destinations: Arize AX + Databricks Unity Catalog
Send one OpenTelemetry trace stream to both Arize AX and Databricks Unity Catalog so engineers can debug agents in Arize while data teams analyze the same spans in governed lakehouse storage.
Memory is still a missing primitive: Cataloguing what the field is actually shipping
This week the field shipped four kinds of memory, and Apple paid Google a billion dollars a year for one of them. None of the four is what the demos imply. A field map of what’s actually shipping, and the missing primitive that sits between the buckets.
Bring production agent traces from Arize into Databricks Unity Catalog
Arize Data Fabric now supports Databricks, helping teams sync production agent traces, evaluations, and annotations into customer-owned storage for governed analysis in Unity Catalog.