What is a dataset
In Arize, a dataset is the fixed set of examples you rerun in experiments to compare changes to your app over time. It gives you a stable benchmark, so you can tell whether a prompt, model, or pipeline update actually improved results or introduced regressions.
What to include
A useful dataset blends typical examples that represent everyday traffic, edge cases the app has struggled with (ambiguous inputs, long contexts, unusual formats), and known failures pulled from traces, evaluator results, or reviewer feedback. Without typical examples, you optimize for edge cases and regress on the common path; without failures, you can’t prove a fix actually holds.Dataset types
You’ll also see datasets described by their source or purpose. These labels overlap and shift as a dataset matures:- Regression: Examples where the app has already failed. Use these to verify a fix holds and doesn’t quietly reintroduce the bug.
- Golden: Inputs with hand-labeled expected outputs — a stable benchmark for comparing prompt and model changes.
- Synthetic: Generated examples that mimic real inputs. Useful when production data is thin, sensitive, or missing the edge cases you want to stress-test.
Dataset row schema
Each row can include input messages, expected outputs, metadata, or any other columns your task function needs. Trace-sourced rows follow the OpenInference convention (e.g.,attributes.input.value). CSVs and inline examples use your own column names. Keep them consistent across sources.
Common dataset row shapes
The labels above describe why a row belongs in the dataset. The row itself should match what your task function reads. Common patterns include: Key-value rows. Use this when the task needs multiple fields such as an input, retrieved context, and an expected output.| Input | Context | Output |
|---|---|---|
What is Paul Graham known for? | Paul Graham is an investor, entrepreneur, and computer scientist known for... | Paul Graham is known for co-founding Y Combinator... |
| Input | Output |
|---|---|
"do you have to have two license plates in ontario" | "True" |
Creating a dataset
Pick the tool you work in. Each tab covers the trace-based, file-upload, and synthetic paths where they apply.- By Arize Skills
- By Alyx
- By UI
- By Code
The Arize skills plugin wires dataset and trace workflows into your coding agent through the 
ax CLI.From traces. Combine arize-trace with arize-dataset. Try:- “Export error spans from the last 7 days in my
production-chatbotproject and create a dataset callederror-regression-v1.” - “Find spans where
annotation.hallucination.label = 'yes'over the past 14 days and save them ashallucination-examples.”
arize-dataset skill at a CSV, JSON, JSONL, or Parquet file you already have. Try:- “Create a dataset called
billing-qa-v1from./data/billing_qa.csvin mysupportspace.” - “Append the rows in
new_edge_cases.jsonlto my existingedge-casesdataset.”
- “Generate 50 synthetic billing support tickets with
queryandexpected_categoryfields, then save assupport-synthetic-v1.” - “Draft 20 adversarial inputs targeting prompt injection for my chat agent and save as
adversarial-v1.”

Managing your dataset
Add, edit, export, or delete rows as the app evolves. Datasets are versioned, and appends land in the latest version in place.- By Arize Skills
- By Alyx
- By UI
- By Code
Use the 
arize-dataset skill to append, export, or inspect datasets without leaving your editor. Try asking your agent:- “Append the rows in
new_examples.csvto mysupport-regressiondataset.” - “Export the latest version of my
support-ticketsdataset so I can review it offline.” - “Show me the schema and the first five rows of my
support-qa-v1dataset.”

Auto add to dataset
Once the dataset exists, set up rules that automatically add spans when they match your criteria. Auto-add rules keep the dataset current with what’s actually happening in production, without manual curation.From evaluator labels
After you’ve set up an evaluator on a project, add a post-processing step that routes spans to a dataset based on the evaluator’s result. See Create evaluators for evaluator setup, then edit the evaluator configuration for your task.

From filter criteria
You can also auto-add spans that match basic filter criteria without an evaluator, such as high token counts, latency above a threshold, or a specific tool call. Use this when the signal is structural rather than labeled.Next step
Your dataset is in place. Now measure whether prompt, model, or pipeline changes actually improve your AI.Set up an experiment
Define your baseline, decide what to change, and choose Playground or code.
Further reading
- View and manage traces: find spans worth turning into regression cases.
- Human review: turn reviewer feedback into dataset rows.
- Labeling queues: collect labels at scale before you build or update a golden dataset.






