Load a dataset into playground

Many users curate datasets for evaluating their prompts in their playground, which often cover the following use cases:

  • 'Golden datasets' of core examples where it is important to avoid a regression — for example, critical user queries or high-impact business scenarios.

  • Challenge datasets containing hard examples where they would like to hill climb on performance — for example, a dataset of jailbreak prompts or examples of past hallucinations.

When modifying a prompt in the playground, you can test your new prompt across a dataset of examples to validate that the model is hill climbing in terms of performance across challenging examples, without regressing on core business use cases.

Select 'Load a Dataset' to run the template across multiple examples.
View additional metadata associated with each example in the dataset without leaving the playground.
Click on a row to zoom in and scroll through a side-by-side comparison of the original dataset output and the new LLM output for each example in the dataset.

Last updated

Was this helpful?