Run experiment via UI

1. Test a prompt in playground

First, create a dataset. Load the dataset you created into prompt playground, and run it to see your results. Once you’ve finished the run, you can save it as an experiment to track your changes.

2. Run an evaluator on your playground experiments

Use evaluators to automatically measure the quality of your experiment results. Once defined, Arize runs it in the background. Evaluators can be either LLM Judges or code-based assessments.

3. Compare experiment results

Each prompt iteration is stored separately, and Arize makes it easy to compare experiment results against each other with Diff Mode. You can also use Alyx to get automated insights as you compare your experiments, with the ability to both summarize results and highlight key differences across runs.

Run experiment via Code

Check out the API reference for more details:

API Reference: run_experiment

1. Define your dataset

You can create a new dataset or use an existing dataset.

from arize import ArizeClient
import pandas as pd

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})

client = ArizeClient(api_key="your-arize-api-key")
dataset = client.datasets.create(
    space_id="your-arize-space-id",
    name="test_dataset",
    examples=inventions_dataset,
)
dataset_id = dataset.id

2. Define a task

A task is any function that you want to run on a dataset. The simplest version of a task looks like the following:

def task(dataset_row: Dict):
    return dataset_row

When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.

import openai 

def answer_question(dataset_row) -> str:
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )
    
    return response.choices[0].message.content

Task inputs

The task function can take the following optional arguments for convenience, which will automatically pass dataset_row attributes to your task function. The easiest way to access anything you need is using dataset_row.

Parameter	Description	Dataset Row Attribute	Example
`dataset_row`	the entire row of the data, including every column as dictionary key	—	`def task_fn(dataset_row): …`
`input`	experiment run input	`attributes.input.value`	`def task_fn(input): …`
`expected`	the expected output	`attributes.output.value`	`def task_fn(expected): …`
`metadata`	metadata for the function	`attributes.metadata`	`def task_fn(metadata): …`

3. Define an evaluator (Optional)

You can also optionally define an evaluator to assess your task outputs in experiments. These evaluators can be LLM Judges or Code Evaluators. For example, here’s a simple code evaluator that verifies whether the LLM output aligns with the expected output:

from arize.experiments import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

4. Run the experiment

This runs your task function against each row in the dataset, evaluates the outputs, and logs the results and traces to Arize.

experiment, experiment_df = client.experiments.run(
    name="basic-experiment",
    dataset_id=dataset_id,
    task=answer_question,
    evaluators=[is_correct],
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)

We offer several convenience attributes:

concurrency reduces time to complete the experiment.
dry_run=True does not log the result to Arize.
exit_on_error=True makes it easier to debug when an experiment doesn’t run correctly.

Once your experiment has finished running, you can see your experiment results in the Arize AX UI.

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Run experiment

Run experiment via UI

1. Test a prompt in playground

2. Run an evaluator on your playground experiments

3. Compare experiment results

Run experiment via Code

API Reference: run_experiment

1. Define your dataset

2. Define a task

Task inputs

3. Define an evaluator (Optional)

4. Run the experiment

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

​Run experiment via UI

​1. Test a prompt in playground

​2. Run an evaluator on your playground experiments

​3. Compare experiment results

​Run experiment via Code

API Reference: run_experiment

​1. Define your dataset

​2. Define a task

​Task inputs

​3. Define an evaluator (Optional)

​4. Run the experiment

Run experiment via UI

1. Test a prompt in playground

2. Run an evaluator on your playground experiments

3. Compare experiment results

Run experiment via Code

1. Define your dataset

2. Define a task

Task inputs

3. Define an evaluator (Optional)

4. Run the experiment