Experiments

Track and evaluate changes to prompts, models, and retrieval strategies. Run experiments with automatic tracing and evaluation.

Key Capabilities

Automatic tracing of all LLM calls during experiments
Concurrent execution for faster evaluation
Dry-run mode for testing without logging
Built-in evaluator support
Compare experiments side-by-side in the UI

Run an Experiment

Execute a task function across your dataset examples with automatic evaluation, then log the results to Arize. High-level flow:

Resolve the dataset and download examples (cached if enabled)
Execute the task and evaluators with configurable concurrency
Upload results to Arize (unless in dry-run mode)

# Define your task
import openai

def answer_question(dataset_row):
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )

    return response.choices[0].message.content

# Define evaluators (optional)
from arize.experiments import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

# Run an experiment
experiment, experiment_df = client.experiments.run(
    name="prompt-v2-experiment",
    dataset_id="dataset-id",
    task=answer_question,
    evaluators=[is_correct],
)

print(f"Experiment: {experiment}")
print(f"Results DataFrame shape: {experiment_df.shape}")

Dry Run Mode

Execute your experiment locally without logging results to Arize. Use this to test your task and evaluators before committing to a full run.

experiment, experiment_df = client.experiments.run(
    ...,
    dry_run=True,  # Test locally without logging
    dry_run_count=10,  # Only run on first 10 examples
)

# Note: experiment is None in dry-run mode
print(f"Results DataFrame shape: {experiment_df.shape}")

Concurrency Control

Control parallelism for faster execution.

experiment, experiment_df = client.experiments.run(
    ...,
    concurrency=10,  # Run 10 examples in parallel
)

Error Handling

Stop execution on the first error encountered.

experiment, experiment_df = client.experiments.run(
    ...,
    exit_on_error=True,  # Stop on first error
)

OpenTelemetry Tracing

Set the global OpenTelemetry tracer provider for the experiment run.

experiment, experiment_df = client.experiments.run(
    ...,
    set_global_tracer_provider=True,  # Enable global OTel tracing
)

List Experiments

List all experiments for a dataset.

resp = client.experiments.list(
    dataset_id="dataset-id",
    limit=50,
)

for experiment in resp.experiments:
    print(experiment.id, experiment.name)

For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Create an Experiment

Log pre-computed experiment results to Arize. Use this when you’ve already executed your experiment elsewhere and want to record the results. Unlike run(), this does not execute the task - it only logs existing results.

from arize.experiments import (
    ExperimentTaskFieldNames,
    EvaluationResultFieldNames,
)

experiment_runs = [
    {
        "example_id": "ex-1",
        "output": "Paris is the capital of France",
        "latency_ms": 245,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
    {
        "example_id": "ex-2",
        "output": "William Shakespeare wrote Romeo and Juliet",
        "latency_ms": 198,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
]

task_fields = ExperimentTaskFieldNames(
    example_id="example_id",
    output="output",
)

evaluator_columns = {
    "Correctness": EvaluationResultFieldNames(
        score="correctness_score",
        label="correctness_label",
    )
}

experiment = client.experiments.create(
    name="pre-computed-experiment",
    dataset_id="dataset-id",
    experiment_runs=experiment_runs,
    task_fields=task_fields,
    evaluator_columns=evaluator_columns,
)

Get an Experiment

Retrieve experiment details and metadata.

experiment = client.experiments.get(experiment_id="experiment-id")

print(experiment)

Delete an Experiment

Delete an experiment by ID. This operation is irreversible. There is no response from this call.

client.experiments.delete(experiment_id="experiment-id")

print("Experiment deleted successfully")

List Experiment Runs

Retrieve individual runs from an experiment with pagination support.

resp = client.experiments.list_runs(
    experiment_id="experiment-id",
    limit=100,
)

for run in resp.experiment_runs:
    print(run)

For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects. Learn more: Experiments Documentation

Version 8

Version 7

Key Capabilities

Run an Experiment

Dry Run Mode

Concurrency Control

Error Handling

OpenTelemetry Tracing

List Experiments

Create an Experiment

Get an Experiment

Delete an Experiment

List Experiment Runs

Version 8

Version 7

​Key Capabilities

​Run an Experiment

​Dry Run Mode

​Concurrency Control

​Error Handling

​OpenTelemetry Tracing

​List Experiments

​Create an Experiment

​Get an Experiment

​Delete an Experiment

​List Experiment Runs

Key Capabilities

Run an Experiment

Dry Run Mode

Concurrency Control

Error Handling

OpenTelemetry Tracing

List Experiments

Create an Experiment

Get an Experiment

Delete an Experiment

List Experiment Runs