Latest Resources

Practical guides, field notes & frameworks for reliable AI agents.

Code-first for engineers. Quality frameworks for product managers. Operating models for leaders.

Guide

Agent harnesses: How to trace, evaluate, and improve AI agents

Learn how agent harnesses use tracing and evaluations to make AI agents observable, testable, safer, and easier to improve in production.

Aaron Winston 1 min read Jul 2026

Read the story →

Guide

The definitive guide to LLM evaluations

LLM evaluation: Get from pre-production to deployment with our definitive guide to LLM evaluation. Includes LLM eval types, use cases, templates and…

1 min read

Guide

AI agent evaluation: An agent-native framework

Learn how to evaluate AI agents across outcomes, trajectories, decisions, and repeated-run reliability using traces, checks, and LLM judges.

1 min read

Guide

Why do you need evals for AI agents?

Agent failures hide inside correct-looking outputs. Learn why evals are the only mechanism that catches them, and how to wire them into…

1 min read

From the Blog

Latest field notes & frameworks.

Explore More →

Case Studies

From traditional ML to AI agents: How Booking.com scales AI observability with Arize

How Booking.com built a unified AI observability stack with Arize for agentic GenAI workflows and traditional ML — from telemetry collection and…

11 min read

Post

How to write effective AI agent skills: 6 data-backed practices

Three recent studies show what actually makes an AI agent skill effective: human expertise, compact procedures, tight routing, harness-specific testing, and eval-gated…

11 min read

Post

Cost per successful task: Benchmarking Kimi K3, GPT-5.5, and 8 more AI models

Arize and Fireworks benchmarked 10 AI models across 2,400 agent runs. Learn why cost per successful task beats token price for model…

16 min read

Customer Lessons

Real teams, shipping AI.

All customer stories →

Blog

How TheFork Leverages Online Evals To Boost Conversions with Arize AX on AWS

TheFork is one of Europe’s leading restaurant discovery and booking platforms, connecting millions of diners with tens of thousands of restaurants across…

4 min read

Events

Agents in the Wild: Priceline’s Journey into Evaluating Voice Applications

Join us for an inside look at how Priceline is evolving its AI-powered travel assistant, Penny, from text-based interactions to voice-enabled experiences. We’ll…

1 min read

Blog

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One

Handshake is the largest early-career network, specializing in connecting students and new grads with employers and career centers. It’s also an engineering…

4 min read

Videos & Talks

Demos, workshops & conference talks.

Watch on YouTube →

Rise of the AI Engineer

An agent got the right answer the wrong way | Michael Grinich, WorkOS

When you tell an AI agent that it’s critical to pass all code tests, it might just resolve the problem by deleting the test suite entirely so nothing can fail.

Don't ship vibes.

Trace, evaluate, and continuously improve your agents — built on open source & open standards.

Get started Self-host now

Docs

Start here

Build your playbook

See it in production

Company

Docs

Start here

Build your playbook

See it in production

Company

Practical guides, field notes & frameworks for reliable AI agents.

Agent harnesses: How to trace, evaluate, and improve AI agents

The definitive guide to LLM evaluations

AI agent evaluation: An agent-native framework

Why do you need evals for AI agents?

Latest field notes & frameworks.

From traditional ML to AI agents: How Booking.com scales AI observability with Arize

How to write effective AI agent skills: 6 data-backed practices

Cost per successful task: Benchmarking Kimi K3, GPT-5.5, and 8 more AI models

Real teams, shipping AI.

How TheFork Leverages Online Evals To Boost Conversions with Arize AX on AWS

Agents in the Wild: Priceline’s Journey into Evaluating Voice Applications

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One

Demos, workshops & conference talks.

An agent got the right answer the wrong way | Michael Grinich, WorkOS

Don't ship vibes.

Docs

Start here

Build your playbook

See it in production

Company

Practical guides, field notes & frameworks for reliable AI agents.

Agent harnesses: How to trace, evaluate, and improve AI agents

The definitive guide to LLM evaluations

AI agent evaluation: An agent-native framework

Why do you need evals for AI agents?

The agent feedback loop, in your inbox.

Latest field notes & frameworks.

From traditional ML to AI agents: How Booking.com scales AI observability with Arize

How to write effective AI agent skills: 6 data-backed practices

Cost per successful task: Benchmarking Kimi K3, GPT-5.5, and 8 more AI models

Real teams, shipping AI.

How TheFork Leverages Online Evals To Boost Conversions with Arize AX on AWS

Agents in the Wild: Priceline’s Journey into Evaluating Voice Applications

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One

Demos, workshops & conference talks.

An agent got the right answer the wrong way | Michael Grinich, WorkOS

Don't ship vibes.

Subscribe to The Evaluator