The Evaluator
Your go-to blog for insights on AI observability and evaluation.
How Arize built AI-native support workflows that cut resolution time in half
Arize reduced median support resolution time from 22 hours to roughly 2.5 hours by building AI-native internal workflows for context gathering, debugging, escalation, and continuous improvement.
How to detect credential theft in AI agent harness traces
In May 2026, a malicious version of a popular VS Code extension spent 18 minutes in the marketplace before anyone caught it. In that time it ran on roughly 6,000…
Phoenix at 10,000 stars on GitHub: How an open source AI observability project grew by following its community
Phoenix crossed 10,000 GitHub stars. Here is how the open-source AI observability project grew from a Jupyter notebook extension into a community-shaped platform for traces, evals, OpenInference, and agents.
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
Building the AI factory for self-improving agents: What’s new in Arize AX
Arize AX is adding managed agents, full-agent experimentation, expanded multimodal support, and Harness-as-a-Judge to help teams observe, evaluate, and improve production agents.
Microsoft’s open trust stack runs on OpenInference
Microsoft’s open trust stack for AI agents puts ASSERT and Agent Control Specification on top of OpenInference, connecting evaluation, runtime controls, and observability through a shared trace contract.
The end of fine-tuning: Why evals, context, and traces matter more
Fine-tuning isn’t dead, but the way most teams iterate on AI products has split in two. A tiny fraction run continuous RL against their own environments; everyone else has moved the iteration loop out of the model and into the harness. Here’s why, and what the 99% should do instead.
AI benchmarks are breaking. Trace analysis is what comes next.
Models got smart enough to cheat their benchmarks, and outcome-only scores stopped measuring what we thought they measured. The fix, full trace analysis, is the same methodology production AI teams have needed all along.
How Hermes implements an open source agent harness architecture
Hermes from NousResearch is a strong open-source agent harness. This post examines how its runtime loop, context management, tool scoping, session infrastructure, and orchestration patterns map to a modern agent harness architecture.
The best eval harness for production AI and agents: A comparison
A practical comparison of production AI evaluation harnesses, including what to look for across instrumentation, evaluators, online evals, CI gates, and agent workflows.
How to build a better agent harness with traces and evals
Agents are easy to prototype and hard to improve. A repeatable loop of traces, evals, failed-span inspection, and targeted harness changes makes agent behavior easier to debug and improve.