The eval maturity model describes how teams make evaluation more systematic, automated, and connected to production workflows over time.
Most teams do not start with a fully automated evaluation system. They usually begin with manual reviews, ad hoc datasets, or UI-based workflows, then gradually move toward developer workflows, agent-assisted debugging, and monitor-triggered automation.
The model has four stages:
Crawl
GUI-first evaluation, where domain experts and product teams can participate through the platform UI.
Walk
AI-assisted workflows, where AI engineering agents are used to help analyze traces, draft evaluators, generate datasets, and run evals.
Run
Headless developer workflows through the CLI, APIs, and coding agents, with evals moving into the development loop.
Fly
Monitor-triggered agents, where production degradation can trigger triage, debugging, and bounded improvement workflows.
The same evaluation harness architecture applies across every stage, but the level of automation that’s used can vary depending on a team’s maturity level.