A golden dataset is one that contains trusted inputs and ideal outputs. These are typically hand-labeled by humans (often with domain expertise) and serve as a benchmark for model output quality.
In AI development, building an evaluation from the ground up requires iteration and testing and golden datasets are often instrumental. This tutorial walks through how to create a benchmark dataset with annotations, then develop a custom LLM evaluator. We refine the evaluator against the golden dataset to ensure it meets quality standards, highlighting practical techniques for improving evaluator accuracy over time.
📓 Notebook
📊 Learn more about datasets