The Evaluator
Your go-to blog for insights on AI observability and evaluation.
Memory is still a missing primitive: Cataloguing what the field is actually shipping
This week the field shipped four kinds of memory, and Apple paid Google a billion dollars a year for one of them. None of the four is what the demos imply. A field map of what’s actually shipping, and the missing primitive that sits between the buckets.
Bring production agent traces from Arize into Databricks Unity Catalog
Arize Data Fabric now supports Databricks, helping teams sync production agent traces, evaluations, and annotations into customer-owned storage for governed analysis in Unity Catalog.
PostgresFS vs. SQL skills: should AI agents fake a filesystem?
Can an AI agent use a database as if it were a filesystem? Arize compared a Postgres-backed filesystem abstraction with a SQL skill and found that locality, accuracy, and maintenance cost favored the skill-based approach.
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
How Arize built AI-native support workflows that cut resolution time in half
Arize reduced median support resolution time from 22 hours to roughly 2.5 hours by building AI-native internal workflows for context gathering, debugging, escalation, and continuous improvement.
How to detect credential theft in AI agent harness traces
In May 2026, a malicious version of a popular VS Code extension spent 18 minutes in the marketplace before anyone caught it. In that time it ran on roughly 6,000…
Phoenix at 10,000 stars on GitHub: How an open source AI observability project grew by following its community
Phoenix crossed 10,000 GitHub stars. Here is how the open-source AI observability project grew from a Jupyter notebook extension into a community-shaped platform for traces, evals, OpenInference, and agents.
Building the AI factory for self-improving agents: What’s new in Arize AX
Arize AX is adding managed agents, full-agent experimentation, expanded multimodal support, and Harness-as-a-Judge to help teams observe, evaluate, and improve production agents.
Microsoft’s open trust stack runs on OpenInference
Microsoft’s open trust stack for AI agents puts ASSERT and Agent Control Specification on top of OpenInference, connecting evaluation, runtime controls, and observability through a shared trace contract.
The end of fine-tuning: Why evals, context, and traces matter more
Fine-tuning isn’t dead, but the way most teams iterate on AI products has split in two. A tiny fraction run continuous RL against their own environments; everyone else has moved the iteration loop out of the model and into the harness. Here’s why, and what the 99% should do instead.
AI benchmarks are breaking. Trace analysis is what comes next.
Models got smart enough to cheat their benchmarks, and outcome-only scores stopped measuring what we thought they measured. The fix, full trace analysis, is the same methodology production AI teams have needed all along.