Skip to main content
Track and evaluate changes to prompts, models, and retrieval strategies. Run experiments with automatic tracing and evaluation.

Key Capabilities

  • Automatic tracing of all LLM calls during experiments
  • Concurrent execution for faster evaluation
  • Dry-run mode for testing without logging
  • Built-in evaluator support
  • Compare experiments side-by-side in the UI

Run an Experiment

Execute a task function across your dataset examples with automatic evaluation, then log the results to Arize. High-level flow:
  1. Resolve the dataset and download examples (cached if enabled)
  2. Execute the task and evaluators with configurable concurrency
  3. Upload results to Arize (unless in dry-run mode)
# Define your task
import openai

def answer_question(dataset_row):
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )

    return response.choices[0].message.content

# Define evaluators (optional)
from arize.experiments import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

# Run an experiment
experiment, experiment_df = client.experiments.run(
    name="prompt-v2-experiment",
    dataset_id="dataset-id",
    task=answer_question,
    evaluators=[is_correct],
)

print(f"Experiment: {experiment}")
print(f"Results DataFrame shape: {experiment_df.shape}")

Dry Run Mode

Execute your experiment locally without logging results to Arize. Use this to test your task and evaluators before committing to a full run.
experiment, experiment_df = client.experiments.run(
    ...,
    dry_run=True,  # Test locally without logging
    dry_run_count=10,  # Only run on first 10 examples
)

# Note: experiment is None in dry-run mode
print(f"Results DataFrame shape: {experiment_df.shape}")

Concurrency Control

Control parallelism for faster execution.
experiment, experiment_df = client.experiments.run(
    ...,
    concurrency=10,  # Run 10 examples in parallel
)

Error Handling

Stop execution on the first error encountered.
experiment, experiment_df = client.experiments.run(
    ...,
    exit_on_error=True,  # Stop on first error
)

OpenTelemetry Tracing

Set the global OpenTelemetry tracer provider for the experiment run.
experiment, experiment_df = client.experiments.run(
    ...,
    set_global_tracer_provider=True,  # Enable global OTel tracing
)

List Experiments

List all experiments for a dataset.
resp = client.experiments.list(
    dataset_id="dataset-id",
    limit=50,
)

for experiment in resp.experiments:
    print(experiment.id, experiment.name)
For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Create an Experiment

Log pre-computed experiment results to Arize. Use this when you’ve already executed your experiment elsewhere and want to record the results. Unlike run(), this does not execute the task - it only logs existing results.
from arize.experiments import (
    ExperimentTaskResultFieldNames,
    EvaluationResultFieldNames,
)

experiment_runs = [
    {
        "example_id": "ex-1",
        "output": "Paris is the capital of France",
        "latency_ms": 245,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
    {
        "example_id": "ex-2",
        "output": "William Shakespeare wrote Romeo and Juliet",
        "latency_ms": 198,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
]

task_fields = ExperimentTaskResultFieldNames(
    example_id="example_id",
    output="output",
)

evaluator_columns = {
    "Correctness": EvaluationResultFieldNames(
        score="correctness_score",
        label="correctness_label",
    )
}

experiment = client.experiments.create(
    name="pre-computed-experiment",
    dataset_id="dataset-id",
    experiment_runs=experiment_runs,
    task_fields=task_fields,
    evaluator_columns=evaluator_columns,
)

Get an Experiment

Retrieve experiment details and metadata.
experiment = client.experiments.get(experiment_id="experiment-id")

print(experiment)

Delete an Experiment

Delete an experiment by ID. This operation is irreversible. There is no response from this call.
client.experiments.delete(experiment_id="experiment-id")

print("Experiment deleted successfully")

List Experiment Runs

Retrieve individual runs from an experiment with pagination support.
resp = client.experiments.list_runs(
    experiment_id="experiment-id",
    limit=100,
)

for run in resp.experiment_runs:
    print(run)
For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects. Learn more: Experiments Documentation