The Evaluator
Your go-to blog for insights on AI observability and evaluation.
GEPA vs Prompt Learning: Benchmarking Different Prompt Optimization Approaches
In June 2025, Andrej Karpathy introduced Software 3.0: the notion that software development is shifting from programming through code to prompting through natural language. When building programs, the goal is…
Tracing, Evaluation, and Observability for Google ADK (How To)
Multi-agent systems are moving from research prototypes to production deployments. But there’s a gap between “it works in the demo” and “it works reliably at scale.” Google’s Agent Development Kit…
Top 5 AI Prompt Management Tools of 2025
Every new AI release rises or falls on how people experience it, and prompts play a major role in shaping that experience. A short sentence can write code, trigger tools,…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations
In our latest paper reading, we had the pleasure of hosting Grégoire Mialon — Research Scientist at Meta Superintelligence Labs — to walk us through Meta AI’s groundbreaking paper titled…
New In Arize AX: Tags, Data Fabric, Automatic Threshold Ranges for Monitors and More
October of 2025 was a crowded month for shipping new features in Arize AX, with updates to make AI agent engineering easier. From a new timeline tab for traces to…
Hyland’s Approach To AI Agent Engineering
Hyland’s AI agent stack pairs Hyland Agent Builder with agentic document processing to bring context-aware agents to core platforms like Onbase, Alfresco, and Nuxeo — turning document understanding into real…
Building the Data Flywheel for Smarter AI Systems with Arize AX and NVIDIA NeMo
Self-driving cars don’t get better by sitting in a lab. They improve by driving millions of miles, capturing edge cases, and feeding that data back into training. Tesla’s fleet generates…
Top LLM Tracing Tools
As of October 2025, 82% of enterprise leaders now rely on generative AI weekly according to a recent report from Wharton and GBK – with three in four seeing positive…
8 Top Prompt Testing and Optimization Tools for LLMs and Multiagent Systems (2025)
If we were to give the year 2025 an AI-appropriate appellation, it would probably be ‘the year of the agents.’ Building atop the startling advances in generative language and image…
ServiceNow’s Tara Bogavelli on AgentArch: Benchmarking AI Agents for Enterprise Workflows
In our latest AI research paper reading, we hosted Tara Bogavelli, Machine Learning Engineer at ServiceNow, to discuss her team’s recent work on AgentArch, a new benchmark designed to evaluate…