Quickstart: Run First Experiment

This guide helps you run experiments to test and validate changes in your LLM applications against a curated dataset. For end-to-end walkthroughs, check out our cookbooks.

1. Upload a CSV as a dataset

Download this sample CSV and upload it into the UI.

2. Test a prompt in playground

Load the dataset you created into prompt playground, and run it to see your results. Once you've finished the run, you can save it as an experiment to track your changes.

3. Run an evaluator on your playground experiments

Create a task to run evaluations on your experiment results. Arize will run the evaluator task in the background as soon as you create the task.

Compare experiments

With Diff Mode enabled, you can compare experiments side-by-side to easily spot improvements and regressions. Learn more

1. Install dependencies and get your API key

pip install arize openai pandas 'arize-phoenix[evals]'

Setup client

Grab your Arize Space ID and API key from the Settings page.

from arize.experimental.datasets import ArizeDatasetsClient

arize_client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)

2. Create your dataset

Datasets are useful groupings of data points you want to use for test cases. You can create datasets through code, using LLMs to auto-generate them, or by importing spans using the Arize UI. Here's an example of creating them with code:

import pandas as pd 
from arize.experimental.datasets.utils.constants import GENERATIVE

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})

# Create Arize dataset
dataset_id = arize_client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_dataset", dataset_type=GENERATIVE, data=inventions_dataset)

3. Define a task

Define the LLM task to be tested against your dataset here. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.

The input is the dataset_row so you can access the variables in your dataset, and the output is a string.

Learn more about how to create a task for an experiment.

import openai 

def answer_question(dataset_row) -> str:
    invention = dataset_row.get("attributes.input.value")# ex: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )
    
    return response.choices[0].message.content

4. Define your evaluators

An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.

Read more about creating evaluators for experiments.

from arize.experimental.datasets.experiments.types import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

5. Run the experiment

To run an experiment, you need to specify the space you are logging it to, the dataset_id, the task you would like to run, and the list of evaluators defined on the output. This also logs the traces to Arize so you can debug each run.

You can specify dry_run=True , which does not log the result to Arize. You can also specify exit_on_error=True , which makes it easier to debug when an experiment doesn't run correctly.

See the run_experiment SDK definition for more info.

arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID, 
    dataset_id=dataset_id,
    task=answer_question, 
    evaluators=[is_correct], #include your evaluation functions here 
    experiment_name="basic-experiment",
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)

View the experiment result in Arize

Navigate to the dataset in the UI and see the experiment output table.

Last updated 1 month ago

Was this helpful?