> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Experiments

> Track and evaluate changes to prompts, models, and retrieval strategies. Run experiments with automatic tracing and concurrent evaluation.

<Note>
  The `experiments` client methods are currently in **BETA**. The API may change without notice. A one-time warning is emitted on first use. The `run` method is **stable**.
</Note>

Track and evaluate changes to prompts, models, and retrieval strategies. Run experiments with automatic tracing and evaluation.

## Key Capabilities

* Automatic tracing of all LLM calls during experiments
* Concurrent execution for faster evaluation
* Dry-run mode for testing without logging
* Built-in evaluator support
* Compare experiments side-by-side in the UI

## List Experiments

List all experiments, optionally filtered by dataset or space.

```python theme={null}
resp = client.experiments.list(
    dataset="dataset-name-or-id",   # optional
    space="your-space-name-or-id",  # optional
    limit=50,
)

for experiment in resp.experiments:
    print(experiment.id, experiment.name)
```

For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see [Response Objects](/docs/api-clients/python/version-8/overview#response-objects).

## Create an Experiment

Log pre-computed experiment results to Arize. Use this when you've already executed your experiment elsewhere and want to record the results. Unlike `run()`, this does not execute the task - it only logs existing results.

<CodeGroup>
  ```python From Dictionaries theme={null}
  from arize.experiments import (
      ExperimentTaskFieldNames,
      EvaluationResultFieldNames,
  )

  experiment_runs = [
      {
          "example_id": "ex-1",
          "output": "Paris is the capital of France",
          "latency_ms": 245,
          "correctness_score": 1.0,
          "correctness_label": "correct",
      },
      {
          "example_id": "ex-2",
          "output": "William Shakespeare wrote Romeo and Juliet",
          "latency_ms": 198,
          "correctness_score": 1.0,
          "correctness_label": "correct",
      },
  ]

  task_fields = ExperimentTaskFieldNames(
      example_id="example_id",
      output="output",
  )

  evaluator_columns = {
      "Correctness": EvaluationResultFieldNames(
          score="correctness_score",
          label="correctness_label",
      )
  }

  experiment = client.experiments.create(
      name="pre-computed-experiment",
      dataset="dataset-name-or-id",
      experiment_runs=experiment_runs,
      task_fields=task_fields,
      evaluator_columns=evaluator_columns,
  )
  ```

  ```python From DataFrame theme={null}
  import pandas as pd
  from arize.experiments import (
      ExperimentTaskFieldNames,
      EvaluationResultFieldNames,
  )

  experiment_runs = pd.DataFrame({
      "example_id": ["ex-1", "ex-2"],
      "output": ["Paris is the capital of France", "William Shakespeare wrote Romeo and Juliet"],
      "latency_ms": [245, 198],
      "correctness_score": [1.0, 1.0],
      "correctness_label": ["correct", "correct"],
  })

  task_fields = ExperimentTaskFieldNames(
      example_id="example_id",
      output="output",
  )

  evaluator_columns = {
      "Correctness": EvaluationResultFieldNames(
          score="correctness_score",
          label="correctness_label",
      )
  }

  experiment = client.experiments.create(
      name="pre-computed-experiment",
      dataset="dataset-name-or-id",
      experiment_runs=experiment_runs,
      task_fields=task_fields,
      evaluator_columns=evaluator_columns,
  )
  ```
</CodeGroup>

## Get an Experiment

Retrieve experiment details and metadata by name or ID. When using a name, provide `dataset` and optionally `space` to disambiguate.

```python theme={null}
experiment = client.experiments.get(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
)

print(experiment)
```

## Delete an Experiment

Delete an experiment by name or ID. This operation is irreversible. There is no response from this call.

```python theme={null}
client.experiments.delete(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
)

print("Experiment deleted successfully")
```

## Run an Experiment

Execute a task function across your dataset examples with automatic evaluation, then log the results to Arize.

**High-level flow:**

1. Resolve the dataset and download examples (cached if enabled)
2. Execute the task and evaluators with configurable concurrency
3. Upload results to Arize (unless in dry-run mode)

```python theme={null}
# Define your task
import openai

def answer_question(dataset_row):
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )

    return response.choices[0].message.content

# Define evaluators (optional)
from arize.experiments import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

# Run an experiment
experiment, experiment_df = client.experiments.run(
    name="prompt-v2-experiment",
    dataset="dataset-name-or-id",
    task=answer_question,
    evaluators=[is_correct],
)

print(f"Experiment: {experiment}")
print(f"Results DataFrame shape: {experiment_df.shape}")
```

### Dry Run Mode

Execute your experiment locally without logging results to Arize. Use this to test your task and evaluators before committing to a full run.

```python theme={null}
experiment, experiment_df = client.experiments.run(
    ...,
    dry_run=True,  # Test locally without logging
    dry_run_count=10,  # Only run on first 10 examples
)

# Note: experiment is None in dry-run mode
print(f"Results DataFrame shape: {experiment_df.shape}")
```

### Concurrency Control

Control parallelism for faster execution.

```python theme={null}
experiment, experiment_df = client.experiments.run(
    ...,
    concurrency=10,  # Run 10 examples in parallel
)
```

### Error Handling

Stop execution on the first error encountered.

```python theme={null}
experiment, experiment_df = client.experiments.run(
    ...,
    exit_on_error=True,  # Stop on first error
)
```

### OpenTelemetry Tracing

Set the global OpenTelemetry tracer provider for the experiment run.

```python theme={null}
experiment, experiment_df = client.experiments.run(
    ...,
    set_global_tracer_provider=True,  # Enable global OTel tracing
)
```

## List Experiment Runs

Retrieve individual runs from an experiment with pagination support. Pass `all=True` to fetch all runs via Flight (ignores `limit`).

```python theme={null}
resp = client.experiments.list_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
    limit=100,
)

for run in resp.experiment_runs:
    print(run)
```

For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see [Response Objects](/docs/api-clients/python/version-8/overview#response-objects).

## Append Experiment Runs

Append new runs to an existing experiment. Runs are inserted in input order. Provide between 1 and 1000 runs per request. Each run must include `example_id` (an existing dataset example) and `output`; additional user-defined fields (e.g. `latency_ms`, `model`) are allowed.

```python theme={null}
new_runs = [
    {
        "example_id": "ex-3",
        "output": "Marie Curie won two Nobel Prizes",
        "latency_ms": 312,
    },
]

result = client.experiments.append_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
    experiment_runs=new_runs,
)

print(result.run_ids)
```

## Annotate Experiment Runs

Write human annotations to a batch of runs in an experiment. Annotations are upserted by annotation config name for each run; submitting the same name for the same run overwrites the previous value. Up to 1000 runs may be annotated per request. This method returns `None` on success.

```python theme={null}
from arize.experiments.types import AnnotateRecordInput, AnnotationInput

client.experiments.annotate_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # optional, used to resolve experiment by name
    space="your-space-name-or-id",  # optional, used to resolve dataset by name
    annotations=[
        AnnotateRecordInput(
            record_id="your-run-id",
            values=[
                AnnotationInput(name="accuracy", label="correct", score=1.0),
                AnnotationInput(name="notes", text="Well-structured output"),
            ],
        ),
    ],
)
```

**Learn more:** [Experiments Documentation](https://arize.com/docs/ax/develop/datasets-and-experiments)