The Evaluator
Your go-to blog for insights on AI observability and evaluation.
 
        ServiceNow’s Tara Bogavelli on AgentArch: Benchmarking AI Agents for Enterprise Workflows
In our latest AI research paper reading, we hosted Tara Bogavelli, Machine Learning Engineer at ServiceNow, to discuss her team’s recent work on AgentArch, a new benchmark designed to evaluate…
 
        OpenAI’s Santosh Vempala Explains Why Language Models Hallucinate
In our latest AI research paper reading, we hosted Santosh Vempala, Professor at Georgia Tech and co-author of OpenAI’s paper, “Why Language Models Hallucinate.” This paper offers one of the…
 
        Building the Data Flywheel for Smarter AI Systems with Arize AX and NVIDIA NeMo
Self-driving cars don’t get better by sitting in a lab. They improve by driving millions of miles, capturing edge cases, and feeding that data back into training. Tesla’s fleet generates…
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
 
        What Are the Top LLM Evaluation Tools?
AI agents and real-world applications of generative AI are debuting at an incredible clip this year, narrowing the time from AI research paper to industry application and propelling productivity growth…
 
        Arize AI Achieves ISO/IEC 27001 Certification
Organizations running AI agents in production depend on Arize to operate securely at scale, logging over 1 trillion inferences and spans and 10 million evaluation runs monthly. Today, we’re proud…
 
        Optimizing Coding Agent Rules (./clinerules) for Improved Accuracy
Coding agents have become the focal point of modern software development. Tools like Cursor, Claude Code, Codex, Cline, Windsurf, Devin, and many more are revolutionalizing how engineers write and ship…
 
        Keller Williams: Rise of the Agent Engineer
Austin, Texas-based Keller Williams Realty, LLC is the world’s largest real estate franchise by agent count. It has more than 1,000 market center offices and 161,000 affiliated agents. The franchise…
 
        Should I Use the Same LLM for My Eval as My Agent? Testing Self-Evaluation Bias
Thanks to Aparna Dhinakaran and Elizabeth Hutton for their contributions to this piece. When building and testing AI agents, one practical question that arises is whether to use the same…
 
        New In Arize AX: Session and Trace Evals, Alyx’s Synthetic Data Generation, and more
September was a busy month product-wise for Arize AX, with updates to make AI agent engineering faster and easier. From session and trace evals to Alyx’s new synthetic data generation…
 
        Testing Binary vs Score Evals on the Latest Models
Thanks to Hamel Husain and Eugene Yan for reviewing this piece Evals are becoming the predominant approach for how AI engineers systematically evaluate the quality of the LLM generated outputs….
 
                             
                               
                             
                                           
                                          