The Evaluator
Your go-to blog for insights on AI observability and evaluation.
 
        Building the Data Flywheel for Smarter AI Systems with Arize AX and NVIDIA NeMo
Self-driving cars don’t get better by sitting in a lab. They improve by driving millions of miles, capturing edge cases, and feeding that data back into training. Tesla’s fleet generates…
 
        Top LLM Tracing Tools
As of October 2025, 82% of enterprise leaders now rely on generative AI weekly according to a recent report from Wharton and GBK – with three in four seeing positive…
 
        8 Top Prompt Testing and Optimization Tools for LLMs and Multiagent Systems (2025)
If we were to give the year 2025 an AI-appropriate appellation, it would probably be ‘the year of the agents.’ Building atop the startling advances in generative language and image…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
 
        ServiceNow’s Tara Bogavelli on AgentArch: Benchmarking AI Agents for Enterprise Workflows
In our latest AI research paper reading, we hosted Tara Bogavelli, Machine Learning Engineer at ServiceNow, to discuss her team’s recent work on AgentArch, a new benchmark designed to evaluate…
 
        OpenAI’s Santosh Vempala Explains Why Language Models Hallucinate
In our latest AI research paper reading, we hosted Santosh Vempala, Professor at Georgia Tech and co-author of OpenAI’s paper, “Why Language Models Hallucinate.” This paper offers one of the…
 
        What Are the Top LLM Evaluation Tools?
AI agents and real-world applications of generative AI are debuting at an incredible clip this year, narrowing the time from AI research paper to industry application and propelling productivity growth…
 
        Arize AI Achieves ISO/IEC 27001 Certification
Organizations running AI agents in production depend on Arize to operate securely at scale, logging over 1 trillion inferences and spans and 10 million evaluation runs monthly. Today, we’re proud…
 
        Optimizing Coding Agent Rules (./clinerules) for Improved Accuracy
Coding agents have become the focal point of modern software development. Tools like Cursor, Claude Code, Codex, Cline, Windsurf, Devin, and many more are revolutionalizing how engineers write and ship…
 
        Keller Williams: Rise of the Agent Engineer
Austin, Texas-based Keller Williams Realty, LLC is the world’s largest real estate franchise by agent count. It has more than 1,000 market center offices and 161,000 affiliated agents. The franchise…
 
        Should I Use the Same LLM for My Eval as My Agent? Testing Self-Evaluation Bias
Thanks to Aparna Dhinakaran and Elizabeth Hutton for their contributions to this piece. When building and testing AI agents, one practical question that arises is whether to use the same…
 
                             
                               
                             
                                           
                                          