Offline evaluation runs against a fixed dataset outside the production request path. It is best for pre-release testing, regression checks, prompt iteration, and reproducible comparisons. Online evaluation runs on production traffic or live traces. It is best for detecting real-world drift, monitoring quality, and finding failure modes that test sets missed.
Strong AI teams use both. Offline evals provide control and comparability. Online evals provide realism. The gap between the two is often where production failures hide.