Datasets

Version controlled examples to run your experiments

Datasets are the foundation for evaluating and improving your LLM application. They enable consistent and repeatable assessments by using structured collections of example data.

Quickstart: Experiments
Uploading a dataset in Arize

Key features

  1. Create integrated versioned datasets to store test cases and track version history

  2. Evaluate your experiments and track application performance over time

  3. Update datasets to modify or add to existing dataset versions

  4. Export datasets to manipulate in code or download

How to Create Datasets

You can build datasets from a variety of sources:

1. Manually Curated Examples The best place to start. Based on your understanding of the application, you can define a handful of examples—20 well-crafted ones often go a long way. These examples help cover expected behavior and common edge cases.

2. Historical Logs Once your application is live, you’ll begin collecting valuable usage data. Logs can reveal examples where the app struggled (e.g., user dissatisfaction, high latency). Add these examples to datasets to continually test against real-world issues.

3. Synthetic Data With a few solid examples in hand, you can use LLMs to generate many similar examples. Synthetic data is useful for scaling evaluations quickly, but it should be guided by well-designed source examples.

Dataset Structures

Arize supports flexible formats depending on your LLM application's needs:

1. Key-Value Pairs Great for multi-input/multi-output tasks like function calls, agents, or classification tasks.

Input
Context
Output
What is Paul Graham known for?", "context": "Paul Graham is an investor..."}

"Paul Graham is an investor, entrepreneur, and computer scientist known for..."

"Paul Graham is known for co-founding Y Combinator..."}

2. Prompt-Completion (String Pairs) Ideal for testing single-turn completions.

Input
Output
"do you have to have two license plates in ontario"}
"true"

3. Messages or Chat Format Best suited for conversational agents.

Input:
{"messages": [{"role": "system", "content": "You are an expert SQL assistant"}]}
Output:
{"messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]}

Types of Datasets

Golden Datasets A golden dataset contains trusted inputs and ideal outputs. These are typically hand-labeled and serve as a benchmark for model quality.

Input
Output

Paris is the capital of France

True

Canada borders the United States

True

The native language of Japan is English

False

Golden datasets are useful for regression testing and validating performance before releases.

Regression Datasets A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.

Input
Output

What's the boiling point of water on Mars?

I don't know

Translate 'cat' to Spanish

Translation not available

Summarize: 'The U.S. economy grew 3% last quarter

No summary found

Learn More

Video tutorial

Last updated

Was this helpful?