The Evaluator
Your go-to blog for insights on AI observability and evaluation.
New In Arize AX: OpenInference TypeScript 2.0, Session Annotations, Integrations Revamp
Arize AX released a flurry of updates in November of 2025. From OpenInference TypeScript 2.0 to a revamp of integrations, there is a lot to catch up on. OpenInference TypeScript…
AWS Bedrock AgentCore Observability with Arize AX: Operationalizing AI Agents At Scale
Building an AI agent in a notebook is straightforward. Getting that agent to run reliably at scale is a different challenge entirely. Most teams hit the same production walls: agents…
Google TUMIX AI Agent Paper, Explained By Its Author
In our latest paper reading, we had the pleasure of featuring Yongchao Chen — a Research Scientist Intern at Google and PhD candidate at MIT and Harvard. He covered his…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
CLAUDE.md: Best Practices Learned from Optimizing Claude Code with Prompt Learning
In our last post on Prompt Learning (our prompt optimization feature), we optimized Cline, a powerful coding agent, through its system prompt. This time, we used it on one that…
How To Improve AI Agent Security with Microsoft’s AI Red Teaming Agent in Microsoft Foundry
Building safe AI isn’t optional anymore. Every model deployed to production faces adversarial users trying to make it behave badly. Microsoft Foundry gives you automated red teaming – essentially a…
Evaluating and Improving AI Agents at Scale with Microsoft Foundry
The Case for Continuous AI Quality As generative and agentic systems mature, the question for enterprises is no longer simply “can we build it?” It is “can we trust it?”….
GEPA vs Prompt Learning: Benchmarking Different Prompt Optimization Approaches
In June 2025, Andrej Karpathy introduced Software 3.0: the notion that software development is shifting from programming through code to prompting through natural language. When building programs, the goal is…
Tracing, Evaluation, and Observability for Google ADK (How To)
Multi-agent systems are moving from research prototypes to production deployments. But there’s a gap between “it works in the demo” and “it works reliably at scale.” Google’s Agent Development Kit…
Top 5 AI Prompt Management Tools of 2025
Every new AI release rises or falls on how people experience it, and prompts play a major role in shaping that experience. A short sentence can write code, trigger tools,…
Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations
In our latest paper reading, we had the pleasure of hosting Grégoire Mialon — Research Scientist at Meta Superintelligence Labs — to walk us through Meta AI’s groundbreaking paper titled…