Latest Resources

Practical guides, field notes & frameworks for reliable AI agents.

Code-first for engineers. Quality frameworks for product managers. Operating models for leaders.

Guide

Agent harnesses: How to trace, evaluate, and improve AI agents

Learn how agent harnesses use tracing and evaluations to make AI agents observable, testable, safer, and easier to improve in production.

Aaron Winston 1 min read Jul 2026

Read the story →

Guide

The definitive guide to LLM evaluations

LLM evaluation: Get from pre-production to deployment with our definitive guide to LLM evaluation. Includes LLM eval types, use cases, templates and…

1 min read

Guide

AI agent evaluation: An agent-native framework

Learn how to evaluate AI agents across outcomes, trajectories, decisions, and repeated-run reliability using traces, checks, and LLM judges.

1 min read

Guide

Why do you need evals for AI agents?

Agent failures hide inside correct-looking outputs. Learn why evals are the only mechanism that catches them, and how to wire them into…

1 min read

From the Blog

Latest field notes & frameworks.

Explore More →

Post

How to write effective AI agent skills: 6 data-backed practices

Three recent studies show what actually makes an AI agent skill effective: human expertise, compact procedures, tight routing, harness-specific testing, and eval-gated…

11 min read

Post

Cost per successful task: Benchmarking Kimi K3, GPT-5.5, and 8 more AI models

Arize and Fireworks benchmarked 10 AI models across 2,400 agent runs. Learn why cost per successful task beats token price for model…

16 min read

Post

How to measure human-LLM judge alignment

No single metric proves an LLM judge is trustworthy. This field guide shows how to measure human–human agreement, compare it to LLM–human…

15 min read

Customer Lessons

Real teams, shipping AI.

All customer stories →

Blog

How TheFork Leverages Online Evals To Boost Conversions with Arize AX on AWS

TheFork is one of Europe’s leading restaurant discovery and booking platforms, connecting millions of diners with tens of thousands of restaurants across…

4 min read

Events

Agents in the Wild: Priceline’s Journey into Evaluating Voice Applications

Join us for an inside look at how Priceline is evolving its AI-powered travel assistant, Penny, from text-based interactions to voice-enabled experiences. We’ll…

1 min read

Blog

How Handshake Deployed and Scaled 15+ LLM Use Cases In Under Six Months — With Evals From Day One

Handshake is the largest early-career network, specializing in connecting students and new grads with employers and career centers. It’s also an engineering…

4 min read

Videos & Talks

Demos, workshops & conference talks.

Watch on YouTube →

Rise of the AI Engineer

An agent got the right answer the wrong way | Michael Grinich, WorkOS

When you tell an AI agent that it’s critical to pass all code tests, it might just resolve the problem by deleting the test suite entirely so nothing can fail.

Don't ship vibes.

Trace, evaluate, and continuously improve your agents — built on open source & open standards.

Get started Self-host now

Docs

Start here

Build your playbook

See it in production

Company

Docs

Start here

Build your playbook

See it in production

Company

Practical guides, field notes & frameworks for reliable AI agents.