The Evaluator
Your go-to blog for insights on AI observability and evaluation.
Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations
In our latest paper reading, we had the pleasure of hosting Grégoire Mialon — Research Scientist at Meta Superintelligence Labs — to walk us through Meta AI’s groundbreaking paper titled…
New In Arize AX: Tags, Data Fabric, Automatic Threshold Ranges for Monitors and More
October of 2025 was a crowded month for shipping new features in Arize AX, with updates to make AI agent engineering easier. From a new timeline tab for traces to…
Hyland’s Approach To AI Agent Engineering
Hyland’s AI agent stack pairs Hyland Agent Builder with agentic document processing to bring context-aware agents to core platforms like Onbase, Alfresco, and Nuxeo — turning document understanding into real…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
Building the Data Flywheel for Smarter AI Systems with Arize AX and NVIDIA NeMo
Self-driving cars don’t get better by sitting in a lab. They improve by driving millions of miles, capturing edge cases, and feeding that data back into training. Tesla’s fleet generates…
Top LLM Tracing Tools
As of October 2025, 82% of enterprise leaders now rely on generative AI weekly according to a recent report from Wharton and GBK – with three in four seeing positive…
8 Top Prompt Testing and Optimization Tools for LLMs and Multiagent Systems (2025)
If we were to give the year 2025 an AI-appropriate appellation, it would probably be ‘the year of the agents.’ Building atop the startling advances in generative language and image…
ServiceNow’s Tara Bogavelli on AgentArch: Benchmarking AI Agents for Enterprise Workflows
In our latest AI research paper reading, we hosted Tara Bogavelli, Machine Learning Engineer at ServiceNow, to discuss her team’s recent work on AgentArch, a new benchmark designed to evaluate…
OpenAI’s Santosh Vempala Explains Why Language Models Hallucinate
In our latest AI research paper reading, we hosted Santosh Vempala, Professor at Georgia Tech and co-author of OpenAI’s paper, “Why Language Models Hallucinate.” This paper offers one of the…
What Are the Top LLM Evaluation Tools?
AI agents and real-world applications of generative AI are debuting at an incredible clip this year, narrowing the time from AI research paper to industry application and propelling productivity growth…
Arize AI Achieves ISO/IEC 27001 Certification
Organizations running AI agents in production depend on Arize to operate securely at scale, logging over 1 trillion inferences and spans and 10 million evaluation runs monthly. Today, we’re proud…