We have four ways of loading data into a dataset

Create a dataset from CSV

You can upload CSVs as a dataset in Arize. Your columns in the file can be accessed in experiments or in prompt playground.

Create a dataset from your spans

Arize supports adding spans from your projects to datasets. The trace data from an application with errors or faulty evals can become fuel for ongoing development. You can use our tracing filters or ✨AI Search to curate your dataset.

Create a dataset with code

If you’d like to create your datasets programmatically, you can using our clients to create, update, and delete datasets. To start let’s install the packages we need:

pip install --pre arize pandas

You can get your API key by navigating to the “Settings” page.

Let’s setup the Arize Dataset Client to create or update a dataset. See here for API reference.

from arize import ArizeClient

client = ArizeClient(api_key="your-arize-api-key")

You can create many different kinds of datasets. The examples below are sorted by complexity.

Simple dataset
Dataset with prompt template & variables

This is a simple dataset with just string values for the columns.

import pandas as pd

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})

dataset = client.datasets.create(
    space_id="your-arize-space-id",
    name="test_invention_dataset",
    examples=inventions_dataset,
)
dataset_id = dataset.id

The datasets in Arize can support flexible columns. You can also add the prompt template and variables to each row.In this example, we are setting attributes.llm.prompt_template.variables. We are using the OpenInference semantic conventions and Arize will automatically import these as input variables.

import pandas as pd
import json

PROMPT_TEMPLATE = """
You are an expert in the history of technological inventions.
Identify the individual or organization that created the following invention.

Invention: {invention}
"""

data = [
    {
        "attributes.llm.prompt_template.template": PROMPT_TEMPLATE,
        "attributes.llm.prompt_template.variables": json.dumps({
            "invention": "Telephone",
        }),
        "attributes.output.value": "Alexander Graham Bell"
    }
]

df = pd.DataFrame(data)

dataset = client.datasets.create(
    space_id="your-arize-space-id",
    name="prompt_invention_dataset",
    examples=df,
)
dataset_id = dataset.id

Create a synthetic dataset

In some cases, the data you have might not be enough to cover all the scenarios you want to test. This is where you can use Alyx for Synthetic Dataset Generation:

Suggested Prompt: “Generate a synthetic dataset of 20 examples that cover…”
Use When: You need labeled examples to test, fine-tune, or evaluate prompts without relying on real user data.
Description: Creates artificial examples that mimic real-world scenarios enabling faster experimentation

You can save your generated examples as a dataset and test them directly in the playground.

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Create a dataset

Create a dataset from CSV

Create a dataset from your spans

Create a dataset with code

Create a synthetic dataset

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

​Create a dataset from CSV

​Create a dataset from your spans

​Create a dataset with code

​Create a synthetic dataset

Create a dataset from CSV

Create a dataset from your spans

Create a dataset with code

Create a synthetic dataset