An evaluation dataset is a collection of examples used to test an AI system. Each example may include inputs, expected outputs, retrieved context, labels, metadata, traces, or scoring criteria.
Evaluation datasets should evolve with production. The best ones include real failures, edge cases, high-value tasks, policy-sensitive examples, and representative user behavior.