Create a dataset

We have four ways of loading data into a dataset

Create a dataset from CSV

You can upload CSVs as a dataset in Arize. Your columns in the file can be accessed in experiments or in prompt playground.


Create a dataset from your spans

Arize supports adding spans from your projects to datasets. The trace data from an application with errors or faulty evals can become fuel for ongoing development. You can use our tracing filters or ✨AI search to curate your dataset.


Create a dataset with code

If you'd like to create your datasets programmatically, you can using our clients to create, update, and delete datasets.

To start let's install the packages we need:

pip install "arize[Datasets]" pandas

You can get your API key by navigating to the "Settings" page.

Let's setup the Arize Dataset Client to create or update a dataset. See here for API reference.

from arize.experimental.datasets import ArizeDatasetsClient
client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)

You can create many different kinds of datasets. The examples below are sorted by complexity.

This is a simple dataset with just string values for the columns.

import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})


dataset_id = client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_invention_dataset", dataset_type=GENERATIVE, data=inventions_dataset)

Create a synthetic dataset

In some cases, the data you have might not be enough to cover all the scenarios you want to test. This is where you can use Alyx for Synthetic Dataset Generation:

  • Suggested Prompt: “Generate a synthetic dataset of 20 examples that cover...”

  • Use When: You need labeled examples to test, fine-tune, or evaluate prompts without relying on real user data. Description: Creates artificial examples that mimic real-world scenarios enabling faster experimentation

You can save your generated examples as a dataset and test them directly in the playground.

Last updated

Was this helpful?