The data and datasets layer is the part of the evaluation system that manages examples used for testing, scoring, experimentation, and improvement. It includes production traces, curated test sets, golden datasets, human labels, synthetic examples, and metadata.
This layer is where eval quality often succeeds or fails. Better evaluators cannot compensate for test cases that do not represent the system's real failure modes.