Create a dataset

We have four ways of loading data into a dataset

Create a dataset from CSV

How to upload a dataset in Arize

You can upload CSVs as a dataset in Arize. Your columns in the file will be attributes that you can access in experiments or in prompt playground.

The primary requirement for the CSV is that it must have an id column. Here's an example CSV snippet, which has the question and response columns as attributes.

id,question,response
1,"here is a question","a satisfactory answer"

Create a dataset from your spans

If you have added tracing to your application, you can create datasets by adding spans from your application with Arize. Go to the traces page and filter for the examples you care about, such as spans with a hallucination label.

You can use our tracing filters or ✨AI search to curate your dataset, and add them to your dataset using the buttons below on the tracing page.

Create a dataset with code

If you'd like to create your datasets programmatically, you can using our python SDK to create, update, and delete datasets.

To start let's install the packages we need:

!pip install "arize[Datasets]" pandas

Let's get your developer key by clicking "code" on the datasets page.

Let's setup the Arize Dataset Client to create or update a dataset. See here for API reference.

from arize.experimental.datasets import ArizeDatasetsClient
import pandas as pd

client = ArizeDatasetsClient(developer_key=developer_key)

You can create many different kinds of datasets. The examples below are sorted by complexity.

If you are looking to upload a standard set of examples with string inputs, you can create the dataframe as such.

import pandas as pd
import json
from arize.experimental.datasets.utils.constants import GENERATIVE

data = [{
    "persona": "An aspiring musician who is writing their own songs",
    "problem": "I often get stuck overthinking my lyrics and melodies.",
}]

df = pd.DataFrame(data)

dataset_id = client.create_dataset(
    space_id="YOUR_SPACE_ID", 
    dataset_name="Your Dataset",
    dataset_type=GENERATIVE,
    data=df
)

Create a synthetic dataset

When you are first developing with LLMs, you typically start with a prompt and little else. The early iteration gets you to a point where the video demo looks amazing, but there's a lack of confidence in its reliability and robustness.

This is where you can use LLMs to generate examples for you based on your prompt. Here's an example, where we can use ChatGPT or your LLM of choice to create a set of examples you can upload to Arize.

You are a data analyst. You are using LLMs to summarize a document. Create a CSV of 20 test cases with the following columns:

1. Input: The full document text, usually five paragraphs of articles about beauty products.
2. Prompt Variables: A JSON string of metadata attached to the article, such as the article title, date, and website URL
3. Output: The one line summary

This will generate a CSV file for you that looks like:

Coming soon, you'll be able to do this directly in the Arize platform based on your traces and prompts, but in the interim, you can upload this data with code or CSV.

Last updated

Was this helpful?