Build a dataset

What is a dataset

In Arize, a dataset is the fixed set of examples you rerun in experiments to compare changes to your app over time. It gives you a stable benchmark, so you can tell whether a prompt, model, or pipeline update actually improved results or introduced regressions.

Datasets page in Arize AX showing a list of datasets with names, row counts, created by, and timestamps

What to include

A useful dataset blends typical examples that represent everyday traffic, edge cases the app has struggled with (ambiguous inputs, long contexts, unusual formats), and known failures pulled from traces, evaluator results, or reviewer feedback. Without typical examples, you optimize for edge cases and regress on the common path; without failures, you can’t prove a fix actually holds.

Dataset types

You’ll also see datasets described by their source or purpose. These labels overlap and shift as a dataset matures:

Regression: Examples where the app has already failed. Use these to verify a fix holds and doesn’t quietly reintroduce the bug.
Golden: Inputs with hand-labeled expected outputs — a stable benchmark for comparing prompt and model changes.
Synthetic: Generated examples that mimic real inputs. Useful when production data is thin, sensitive, or missing the edge cases you want to stress-test.

A regression set becomes part of a golden dataset once you label the expected output for each row. Collect failures first, label them as you go, and fold in typical traffic so the benchmark isn’t just past bugs.

Dataset row schema

Each row can include input messages, expected outputs, metadata, or any other columns your task function needs. Trace-sourced rows follow the OpenInference convention (e.g., attributes.input.value). CSVs and inline examples use your own column names. Keep them consistent across sources.

Common dataset row shapes

The labels above describe why a row belongs in the dataset. The row itself should match what your task function reads. Common patterns include: Key-value rows. Use this when the task needs multiple fields such as an input, retrieved context, and an expected output.

Input	Context	Output
`What is Paul Graham known for?`	`Paul Graham is an investor, entrepreneur, and computer scientist known for...`	`Paul Graham is known for co-founding Y Combinator...`

Prompt-completion pairs. Use this for the simplest single-turn completion or classification cases.

Input	Output
`"do you have to have two license plates in ontario"`	`"True"`

Messages or chat rows. Use this when your task expects multi-message inputs or outputs.

{
  "input": {
    "messages": [{"role": "system", "content": "You are an expert SQL assistant"}]
  },
  "output": {
    "messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]
  }
}

Choose the shape that matches your task function and keep it consistent within a dataset version.

Creating a dataset

Pick the tool you work in. Each tab covers the trace-based, file-upload, and synthetic paths where they apply.

By Arize Skills
By Alyx
By UI
By Code

The Arize skills plugin wires dataset and trace workflows into your coding agent through the ax CLI.From traces. Combine arize-trace with arize-dataset. Try:

“Export error spans from the last 7 days in my production-chatbot project and create a dataset called error-regression-v1.”
“Find spans where annotation.hallucination.label = 'yes' over the past 14 days and save them as hallucination-examples.”

From a local file. Point the arize-dataset skill at a CSV, JSON, JSONL, or Parquet file you already have. Try:

“Create a dataset called billing-qa-v1 from ./data/billing_qa.csv in my support space.”
“Append the rows in new_edge_cases.jsonl to my existing edge-cases dataset.”

Generate synthetic rows. Have the agent draft examples for you. Try:

“Generate 50 synthetic billing support tickets with query and expected_category fields, then save as support-synthetic-v1.”
“Draft 20 adversarial inputs targeting prompt injection for my chat agent and save as adversarial-v1.”

Coding agent running Arize skills via the ax CLI to create datasets from traces and generated examples

From the Traces table. Filter by status, eval score, latency, annotations, or a natural-language query via AI Search. For example, status_code = 'ERROR' for exceptions, eval.groundedness.score < 0.5 for low-scoring spans, or “traces with hallucinations from yesterday”.

Filter bar in the Arize AX Traces table with multiple span-query conditions applied

Select the spans you want and click Add to Dataset to create a new dataset or append to an existing one. Map at minimum the span’s input and output (stored under attributes.input.value and attributes.output.value); for classification tasks, also include a column with the expected label.

Selecting spans in the Arize AX Traces table and adding them to a new dataset with column mapping

Upload a file. Go to Datasets & Experiments, click + New Dataset, and upload a CSV you’ve generated elsewhere.

New Dataset dialog in Arize AX with a CSV upload drop zone

Use the Arize SDK to create datasets programmatically. For the Python examples, install arize>=8.0.0 and set ARIZE_API_KEY and ARIZE_SPACE_ID in your environment.

import os
from datetime import datetime, timedelta
from arize import ArizeClient

client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
space = os.environ["ARIZE_SPACE_ID"]

# From inline examples
client.datasets.create(
    name="support-qa-v1",
    space=space,
    examples=[
        {"input": "How do I cancel?", "expected_category": "account"},
        {"input": "I was charged twice.", "expected_category": "billing"},
    ],
)

# From traces (columns are OpenInference span attributes)
spans_df = client.spans.export_to_df(
    space_id=space,
    project_name="my-llm-app",
    start_time=datetime.now() - timedelta(days=30),
    end_time=datetime.now(),
)
client.datasets.create(name="traces-v1", space=space, examples=spans_df)

Optional: include prompt template metadata in each row

Continuing from the Python example above, if each row needs to carry the prompt template and its filled variables, store them on the OpenInference prompt-template columns so Playground and code can map them consistently:

Python SDK v8

import json
import pandas as pd

PROMPT_TEMPLATE = """
You are an expert in the history of technological inventions.
Identify the individual or organization that created the following invention.

Invention: {invention}
"""

prompt_rows = pd.DataFrame(
    [
        {
            "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
            "attributes.llm.prompt_template.variables": json.dumps(
                {"invention": "Telephone"}
            ),
            "attributes.output.value": "Alexander Graham Bell",
        }
    ]
)

client.datasets.create(
    name="prompt-invention-dataset",
    space=space,
    examples=prompt_rows,
)

If you’re migrating from Python SDK v7 dataset APIs, see the datasets client migration guide for create_dataset() and update_dataset() replacements.

Managing your dataset

Add, edit, export, or delete rows as the app evolves. Datasets are versioned, and appends land in the latest version in place.

By Arize Skills
By Alyx
By UI
By Code

Use the arize-dataset skill to append, export, or inspect datasets without leaving your editor. Try asking your agent:

“Append the rows in new_examples.csv to my support-regression dataset.”
“Export the latest version of my support-tickets dataset so I can review it offline.”
“Show me the schema and the first five rows of my support-qa-v1 dataset.”

Coding agent running the arize-dataset skill via the ax CLI to append new examples to an existing dataset without leaving the editor

Use the Arize SDK to append or export dataset rows.

import os
from arize import ArizeClient

client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])

# Append rows — lands in the latest version in place.
# Pass `dataset_version_id` to target a specific version.
client.datasets.append_examples(
    dataset="YOUR_DATASET_ID",
    examples=[{"input": "...", "expected_category": "billing"}],
)

# Export examples for offline analysis. `all=True` fetches every row.
examples_df = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID",
    all=True,
).to_df()

For the full reference — including get, delete, DataFrame input, and pagination — see the Python datasets client and TypeScript datasets client.

Auto add to dataset

Once the dataset exists, set up rules that automatically add spans when they match your criteria. Auto-add rules keep the dataset current with what’s actually happening in production, without manual curation.

From evaluator labels

After you’ve set up an evaluator on a project, add a post-processing step that routes spans to a dataset based on the evaluator’s result. See Create evaluators for evaluator setup, then edit the evaluator configuration for your task.

Task configuration page in Arize AX showing the evaluator selection dropdown

Select Auto Add Spans to Dataset, then specify which eval labels should trigger the addition. For example, all spans where Correctness is Incorrect, or any span where the eval label is not null.

Evaluator configuration panel in Arize AX with the 'Auto Add Spans to Dataset' option selected and filter criteria entered

From filter criteria

You can also auto-add spans that match basic filter criteria without an evaluator, such as high token counts, latency above a threshold, or a specific tool call. Use this when the signal is structural rather than labeled.

Next step

Your dataset is in place. Now measure whether prompt, model, or pipeline changes actually improve your AI.

Set up an experiment

Define your baseline, decide what to change, and choose Playground or code.

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

What is a dataset

What to include

Dataset types

Dataset row schema

Common dataset row shapes

Creating a dataset

Managing your dataset

Auto add to dataset

From evaluator labels

From filter criteria

Next step

Set up an experiment

Further reading

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Improve

Machine Learning

Settings

Security

​What is a dataset

​What to include

​Dataset types

​Dataset row schema

​Common dataset row shapes

​Creating a dataset

​Managing your dataset

​Auto add to dataset

​From evaluator labels

​From filter criteria

​Next step

Set up an experiment

​Further reading

What is a dataset

What to include

Dataset types

Dataset row schema

Common dataset row shapes

Creating a dataset

Managing your dataset

Auto add to dataset

From evaluator labels

From filter criteria

Next step

Further reading