> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Experiment in code

> Run experiments from your own runtime for pipelines, agents, sandboxes, and multi-step workflows, then score and compare each run in Arize AX

Use code when the experiment lives outside a single prompt call: a pipeline, agent, sandbox, CI job, or any runtime you control. You can either let the Python client run the task against the dataset for you, or upload results you already produced elsewhere.

## Two approaches

The rest of the page follows these two paths:

|                             | Run an experiment                    | Log an experiment                                     |
| --------------------------- | ------------------------------------ | ----------------------------------------------------- |
| Who executes the task       | Your Python environment              | Your code, anywhere                                   |
| Upload interface            | Python SDK                           | Python SDK, TypeScript/JS SDK, CLI (`ax`), REST API   |
| Concurrency, tracing, evals | Handled by the client                | You handle execution and any scores                   |
| Best for                    | Self-contained Python task functions | Pipelines, agents, sandboxes, CI, and remote runtimes |

Both paths land in the same experiments UI, so you can compare runs side by side after they finish.

## Run an experiment

Use this when the entire task fits inside a single Python function and you want the Python client to orchestrate it for you. The client resolves the dataset, runs the task against every row, scores the outputs with the evaluators you pass, and logs the results for you.

### Define a task

A **task function** is the unit of work you want to measure: a single LLM call, a retrieval pipeline, an agent workflow, or any application logic. This is where you decide what stays fixed and what changes between runs.

If you still need to create or refine the dataset first, use [Build a dataset](/ax/improve/build-a-dataset).

```python Python theme={null}
import openai

def answer_question(dataset_row) -> str:
    question = dataset_row.get("attributes.input.value", "")
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content or ""
```

Task functions receive the full `dataset_row`. Read whatever columns your dataset provides, whether they are OpenInference attributes such as `attributes.input.value` or custom columns such as `input`, `question`, `tickers`, or `focus`.

To version the task's prompt in Prompts instead of hard-coding it, fetch it inside your task with [`client.prompts.get()`](/api-clients/python/version-8/client-resources/prompts#get-a-prompt).

### Add an evaluator

Use a code evaluator when you have deterministic rules or ground truth. Use an LLM judge when the criterion is more subjective. A clear starting point is a function that returns an `EvaluationResult`:

```python Python theme={null}
from arize.experiments import EvaluationResult

def correctness(output, dataset_row):
    expected = dataset_row.get("attributes.output.value", "")
    generated = output or ""
    correct = bool(expected) and expected.lower() in generated.lower()
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation=f"Expected '{expected}', got '{generated[:80]}...'",
    )
```

Keep the first evaluator simple. For deeper evaluator workflows, see [Evaluate](/ax/evaluate/create-evaluators).

### Run it

Pass your task and evaluators to the client with the documented `dataset=` parameter, pointing at the dataset by ID or name.

Continuing from the task and evaluator above:

```python Python theme={null}
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

experiment, experiment_df = client.experiments.run(
    name="support-agent-baseline",
    dataset="YOUR_DATASET_ID_OR_NAME",
    task=answer_question,
    evaluators=[correctness],
    concurrency=10,
)
```

`client.experiments.run()` resolves the dataset, runs the task and evaluators in your Python environment, and logs the results to Arize AX. Name runs consistently so the comparison view diffs cleanly; [Plan a baseline](/ax/improve/set-up-an-experiment#plan-a-baseline) covers the naming pattern itself.

For the full parameter list, see the Python [experiments API reference](/api-clients/python/version-8/client-resources/experiments#run-an-experiment). If you're migrating older experiment code, see the [experiments migration guide](/api-clients/python/version-8/migration/experiments-client).

<Info>
  Use `dry_run=True` to test the loop locally without logging results. In dry-run mode, `client.experiments.run()` returns `(None, experiment_df)`.
</Info>

Once either path has produced runs, the rest of the workflow is the same.

## Log an experiment

Choose this path when the task already runs in its own environment: another service, a TypeScript app, a sandbox, CI, or a notebook with its own orchestration. Compute the outputs yourself, key each row by `example_id`, and upload the finished results.

### Assemble your results

Put your task outputs in a table such as a DataFrame, CSV, JSON, JSONL, or Parquet file. Every row must carry:

* `example_id`, the dataset row ID the result corresponds to
* `output` (or another column you map to it), the task output for that example

If you ran evaluators yourself, include their label and score fields alongside each row. You can map the exact column names when you upload and give that evaluator a name such as `correctness`, or attach evaluators later if you do not have them yet.

If your results already contain example IDs, keep them. If they do not, fetch the dataset examples first and map `id` to `example_id` before you upload the run. When you resolve a dataset by name while fetching examples, pass `space=` to `client.datasets.list_examples()`. Your external job should produce one result row for each dataset example it ran; the example below shows the required shape and maps each row to an existing dataset example ID.

```python Python theme={null}
import pandas as pd
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

dataset_examples = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID",
    all=True,
).to_df()

experiment_run_df = pd.DataFrame(
    {
        "result": ["The telephone was invented by Alexander Graham Bell."],
        "label": ["correct"],
        "score": [1],
    }
)

# Match each result row to the dataset example it came from.
experiment_run_df["example_id"] = dataset_examples["id"].head(
    len(experiment_run_df)
).to_list()
```

### Upload the run

<Tabs>
  <Tab title="By Arize Skills">
    Use the [Arize skills plugin](/ax/set-up-with-ai-assistants) with the [`arize-experiment`](https://github.com/Arize-ai/arize-skills/tree/main/skills/arize-experiment) skill to upload a results file directly. The file must have `example_id` and `output` columns (CSV, JSON, JSONL, or Parquet). See [`ax experiments create`](/api-clients/cli/experiments#ax-experiments-create) for the full schema.

    Try asking your agent:

    * "Upload `runs.csv` as a new experiment on dataset `ds_xxx` and name it `baseline-v1`."
    * "Create an experiment from `nightly_runs.jsonl` for dataset `qa-regression`."

    <Frame caption="Upload experiment results from your coding agent with the arize-experiment skill">
      <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/improve/experiment_run_skill.png" alt="Coding agent using the arize-experiment skill to upload experiment runs and create a new experiment" />
    </Frame>
  </Tab>

  <Tab title="By Code">
    The Python SDK is the most explicit option when you need to map custom task or evaluator columns. The supported `create()` contract is the results table plus `task_fields` and, if you have them, `evaluator_columns`. TypeScript/JS, CLI, and REST are also supported upload interfaces for remote runs.

    The Python example below continues from the `experiment_run_df` assembled above and defines the field mappings before upload.

    <CodeGroup>
      ```python Python theme={null}
      from arize import ArizeClient
      from arize.experiments import ExperimentTaskFieldNames, EvaluationResultFieldNames

      client = ArizeClient(api_key="YOUR_API_KEY")

      task_fields = ExperimentTaskFieldNames(example_id="example_id", output="result")
      evaluator_fields = EvaluationResultFieldNames(
          label="label",
          score="score",
      )

      experiment = client.experiments.create(
          name="my_experiment",
          dataset="YOUR_DATASET_ID",
          experiment_runs=experiment_run_df,
          task_fields=task_fields,
          # `correctness` is the evaluator name shown in Arize AX.
          evaluator_columns={"correctness": evaluator_fields},
      )
      ```

      ```typescript TS/JS theme={null}
      import { createExperiment } from "@arizeai/ax-client";

      const experimentRuns = [
        { exampleId: "ex_1", output: "..." },
        { exampleId: "ex_2", output: "..." },
      ];

      await createExperiment({
        experimentName: "my_experiment",
        dataset: "YOUR_DATASET_ID",  // dataset name or ID (pass `space` when using a name)
        experimentRuns,
      });
      ```
    </CodeGroup>

    * **Python reference:** [`client.experiments.create()`](/api-clients/python/version-8/client-resources/experiments#create-an-experiment)
    * **TypeScript reference:** [`createExperiment()`](/api-clients/typescript/version-1/client-resources/experiments#create-an-experiment)
    * **CLI reference:** [`ax experiments create`](/api-clients/cli/experiments#ax-experiments-create)
    * **REST overview:** [Arize REST API](/ax/rest-reference/overview)

    If you're migrating older Python experiment-logging code, see the [experiments migration guide](/api-clients/python/version-8/migration/experiments-client#log_experiment).
  </Tab>
</Tabs>

Remote experiments fit pipelines you already own. The task stays in your runtime, and Arize AX records and compares the resulting runs.

### Evaluate a remote experiment

You have two options for attaching scores to a remote run:

1. **Score outputs yourself and upload alongside results.** Add evaluator columns such as `label` and `score` to the results DataFrame and map them through `EvaluationResultFieldNames`, as shown in the Python upload example above. Use this when the evaluator logic lives in the same environment as your task.
2. **Attach evaluators in Arize AX after upload.** Upload the experiment without eval columns, then open the experiment results page and click **Add Evaluator**, run the [`arize-evaluator`](https://github.com/Arize-ai/arize-skills/tree/main/skills/arize-evaluator) skill, or run an existing evaluator from the experiment workflow. Use this when you want an LLM-as-a-judge scored from Arize AX itself, especially across remote runs from multiple languages.

Try asking your agent:

* "Create a correctness LLM-as-a-judge evaluator using my OpenAI integration and run it on experiment `exp_xxx`."
* "Score every run in experiment `exp_xxx` with a groundedness judge."

The judge LLM needs stored credentials. Use the [`arize-ai-provider-integration`](https://github.com/Arize-ai/arize-skills/tree/main/skills/arize-ai-provider-integration) skill to set up your OpenAI or Anthropic keys, or your Bedrock role.

For the broader evaluator workflow:

* [Create evaluators](/ax/evaluate/create-evaluators): create or manage reusable evaluators.
* [Run offline evals on experiments](/ax/evaluate/run-evals-on-experiments): run evaluators against an existing experiment.
* [Human review](/ax/evaluate/human-review) and [Labeling queues](/ax/evaluate/labeling-queues): collect labels before you automate.

## Manage your experiments

### Compare experiments

Whether you logged the run or had Arize run it, compare the results in the same experiments UI. For the full walkthrough, see [Compare experiments](/ax/improve/experiment-in-playground#compare-experiments).

### Export or get results

If you want experiment metadata and runs programmatically, pull them in code:

<CodeGroup>
  ```python Python theme={null}
  from arize import ArizeClient

  client = ArizeClient(api_key="YOUR_API_KEY")

  experiment = client.experiments.get(
      experiment="support-agent-baseline",
      dataset="YOUR_DATASET_ID",
  )

  runs_df = client.experiments.list_runs(
      experiment=experiment.id,
      all=True,
  ).to_df()
  ```

  ```typescript TS/JS theme={null}
  import { getExperiment, listExperimentRuns } from "@arizeai/ax-client";

  const experiment = await getExperiment({
    experiment: "your_experiment_id",
  });

  const experimentRuns = await listExperimentRuns({
    experiment: experiment.id,
    limit: 10,
  });
  ```
</CodeGroup>

### Tag a winner

Once a prompt variant clears the baseline on the evaluators you care about, tag that prompt version as `production` in Prompts, or use whatever label your application loads in production. For model or pipeline changes, promote the winning value in your app configuration and keep the experiment name or metadata tied to that promoted version.

### Classification metrics

If each experiment returns a categorical label instead of free-form text, configure classification metrics from the dataset's **Experiments** tab. The full setup for ground-truth mapping, positive-class selection, and metric definitions lives on [Experiment in Playground](/ax/improve/experiment-in-playground#classification-metrics).

## Additional code workflows

Once the main loop is in place, use these patterns to work faster or handle edge cases. The sections below are mostly about the Python `run()` path.

<span id="authoring-evaluators" />

<h3 id="advanced-evaluator-patterns">
  Evaluator patterns
</h3>

If the minimal evaluator above is enough, stop there. Function evaluators can return an `EvaluationResult`, a numeric score, or a string label; use `EvaluationResult` when you want to include score, label, and explanation together. Class-based evaluators can accept mapped inputs such as `input`, `output`, `dataset_row`, and `metadata`. Use the patterns below when you need more than one evaluator in the same run, shared state, or reusable evaluators.

For deeper evaluator references, see [Run offline evals on experiments](/ax/evaluate/run-evals-on-experiments) and [Create evaluators](/ax/evaluate/create-evaluators).

#### Multiple evaluators

Pass a list to `evaluators=` and Arize AX runs each evaluator against each experiment result. Start with multiple function evaluators, or mix function and class-based evaluators in the same list. Each evaluator shows up as its own column in the comparison view.

<Accordion title="Show multiple-evaluator example">
  ```python Python theme={null}
  from arize import ArizeClient
  from arize.experiments import EvaluationResult

  client = ArizeClient(api_key="YOUR_API_KEY")

  def echo_input(dataset_row) -> str:
      return dataset_row.get("attributes.input.value", "")

  def correctness(output, dataset_row) -> EvaluationResult:
      expected = dataset_row.get("attributes.output.value", "")
      generated = output or ""
      correct = bool(expected) and expected.lower() in generated.lower()
      return EvaluationResult(
          score=int(correct),
          label="correct" if correct else "incorrect",
      )

  def has_output(output, dataset_row) -> EvaluationResult:
      present = bool(output)
      return EvaluationResult(
          score=float(present),
          label="present" if present else "missing",
      )

  experiment, experiment_df = client.experiments.run(
      name="multi-eval-experiment",
      dataset="YOUR_DATASET_ID",
      task=echo_input,
      evaluators=[
          correctness,
          has_output,
      ],
  )
  ```
</Accordion>

#### Class-based evaluators

The main API reference focuses on function evaluators. Use a subclass of `Evaluator` when an evaluator holds shared state, runs async, or is reused across projects.

Class-based evaluator methods can request the inputs they need:

| Parameter     | Description                                  | Example                                          |
| ------------- | -------------------------------------------- | ------------------------------------------------ |
| `input`       | Experiment run input                         | `def evaluate(self, input, **kwargs): ...`       |
| `output`      | Experiment run output                        | `def evaluate(self, output, **kwargs): ...`      |
| `dataset_row` | The full dataset row, including every column | `def evaluate(self, dataset_row, **kwargs): ...` |
| `metadata`    | Experiment metadata                          | `def evaluate(self, metadata, **kwargs): ...`    |

<Accordion title="Show code evaluator class example">
  ```python Python theme={null}
  from arize.experiments import EvaluationResult, Evaluator

  class MatchesExpected(Evaluator):
      annotator_kind = "CODE"
      name = "matches_expected"

      def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
          expected_output = dataset_row.get("attributes.output.value")
          matches = expected_output == output
          label = "match" if matches else "mismatch"
          score = float(matches)
          return EvaluationResult(score=score, label=label)

      async def async_evaluate(self, *, output, dataset_row, **kwargs) -> EvaluationResult:
          return self.evaluate(output=output, dataset_row=dataset_row)
  ```
</Accordion>

<Accordion title="Show LLM evaluator class example">
  This example uses Phoenix Evals and an OpenAI-backed judge. Install `phoenix-evals` and set `OPENAI_API_KEY` before running it.

  ```python Python theme={null}
  import os
  import pandas as pd
  from arize.experiments import EvaluationResult, Evaluator
  from phoenix.evals import (
      HALLUCINATION_PROMPT_RAILS_MAP,
      HALLUCINATION_PROMPT_TEMPLATE,
      OpenAIModel,
      llm_classify,
  )

  class HallucinationEvaluator(Evaluator):
      name = "hallucination"

      def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
          expected_output = dataset_row.get("attributes.output.value")
          df_in = pd.DataFrame(
              {"selected_output": [output], "expected_output": [expected_output]}
          )
          result = llm_classify(
              dataframe=df_in,
              template=HALLUCINATION_PROMPT_TEMPLATE,
              model=OpenAIModel(
                  model="gpt-4o-mini",
                  api_key=os.getenv("OPENAI_API_KEY"),
              ),
              rails=HALLUCINATION_PROMPT_RAILS_MAP,
              provide_explanation=True,
          )
          label = result["label"][0]
          return EvaluationResult(
              score=1 if label == "factual" else 0,
              label=label,
              explanation=result["explanation"][0],
          )

      async def async_evaluate(self, *, output, dataset_row, **kwargs) -> EvaluationResult:
          return self.evaluate(output=output, dataset_row=dataset_row)
  ```
</Accordion>

### Async experiments

When throughput matters on the `run()` path, declare your task and evaluators with `async def` and raise `concurrency`. In Jupyter, install `nest_asyncio` with `pip install nest_asyncio` (it is not bundled with `arize`) and call `nest_asyncio.apply()` first so the runner can nest its event loop inside the kernel's.

<Frame caption="Synchronous vs asynchronous task and eval execution">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/run-experiments-code-2.png" alt="Diagram showing synchronous sequential execution versus asynchronous parallel execution for experiment tasks and evaluators" />
</Frame>

<Accordion title="Show async experiment example">
  ```python Python theme={null}
  from arize import ArizeClient
  from arize.experiments import EvaluationResult
  import nest_asyncio

  client = ArizeClient(api_key="YOUR_API_KEY")
  nest_asyncio.apply()

  async def async_task(dataset_row):
      return dataset_row.get("attributes.input.value", "")

  async def async_has_output(output, dataset_row):
      present = bool(output)
      return EvaluationResult(
          score=float(present),
          label="present" if present else "missing",
      )

  experiment, experiment_df = client.experiments.run(
      name="async-experiment",
      dataset="YOUR_DATASET_ID_OR_NAME",
      task=async_task,
      evaluators=[async_has_output],
      concurrency=10,
  )
  ```
</Accordion>

<Tip>
  Start with synchronous tasks and evaluators while you're developing the experiment. Sync failures usually break at the line that raised the error, which makes them easier to debug before you switch to async for throughput.
</Tip>

### Dataset sampling

For quick spot checks or balanced subsets, sample the dataset before running. The fastest path is `dry_run=True` with `dry_run_count`, which runs the task against the first N examples without logging:

<Accordion title="Show dry-run sampling example">
  ```python Python theme={null}
  from arize import ArizeClient

  client = ArizeClient(api_key="YOUR_API_KEY")

  def quick_task(dataset_row) -> str:
      return dataset_row.get("attributes.input.value", "")

  experiment, experiment_df = client.experiments.run(
      name="smoke-test",
      dataset="YOUR_DATASET_ID_OR_NAME",
      task=quick_task,
      dry_run=True,
      dry_run_count=10,
  )
  ```
</Accordion>

When you want a specific sample (random, stratified, or systematic), pull the examples, sample in pandas, drop system-managed fields, and create a temporary dataset:

<Accordion title="Show sampled dataset example">
  ```python Python theme={null}
  from arize import ArizeClient

  client = ArizeClient(api_key="YOUR_API_KEY")

  def sampled_task(dataset_row) -> str:
      return dataset_row.get("attributes.input.value", "")

  examples_df = client.datasets.list_examples(
      dataset="YOUR_DATASET_ID_OR_NAME",
      all=True,
  ).to_df()

  sampled_df = examples_df.sample(frac=0.1, random_state=42)
  # Stratify by a label column to preserve class balance:
  # sampled_df = examples_df.groupby("class_label", group_keys=False).apply(
  #     lambda g: g.sample(frac=0.1, random_state=42)
  # )
  # Or select every 10th row:
  # sampled_df = examples_df.iloc[::10, :]

  # Keep only dataset example columns before creating a new dataset.
  example_columns = [
      col for col in sampled_df.columns if col not in {"id", "created_at", "updated_at"}
  ]
  sampled_examples = sampled_df[example_columns].copy()

  sampled_dataset = client.datasets.create(
      name="support-dataset-sample-10pct",
      space="your-space-name-or-id",
      examples=sampled_examples,
  )

  experiment, experiment_df = client.experiments.run(
      name="sampled-experiment",
      dataset=sampled_dataset.id,
      task=sampled_task,
  )
  ```
</Accordion>

You can run experiments on different samples of the same source dataset. Arize still tracks each run against the data it actually saw, so you can compare and visualize those sampled runs cleanly in the product.

### Experiment tracing

When you want your own spans for retrieval, tool calls, or nested model activity, pass `set_global_tracer_provider=True` so the experiment run registers a global tracer provider for that execution. Use it when you want manual or auto-instrumented tracing to participate in the same run-time tracing setup.

Install the OpenTelemetry and OpenInference packages required by your tracing setup before running these examples. For broader setup guidance, see [Set up tracing](/ax/instrument/set-up-tracing).

**Explicit spans.** Create spans manually for the parts of the task you want visible:

<Accordion title="Show explicit spans example">
  ```python Python theme={null}
  from arize import ArizeClient
  from opentelemetry import trace

  client = ArizeClient(api_key="YOUR_API_KEY")

  def task_add_1(dataset_row):
      tracer = trace.get_tracer(__name__)
      with tracer.start_as_current_span("test_function") as span:
          num = dataset_row.get("attributes.my_number")
          span.set_attribute("dataset.my_number", num)
          return num + 1

  experiment, experiment_df = client.experiments.run(
      name="tracing-demo",
      dataset="YOUR_DATASET_ID_OR_NAME",
      task=task_add_1,
      set_global_tracer_provider=True,
  )
  ```
</Accordion>

**Auto-instrumentor.** For LLM, framework, or vector-store calls, install the matching OpenInference auto-instrumentor so library calls made inside your task can emit spans during the run:

<Accordion title="Show auto-instrumentor example">
  ```python Python theme={null}
  from arize import ArizeClient
  from openai import OpenAI
  from openinference.instrumentation.openai import OpenAIInstrumentor

  client = ArizeClient(api_key="YOUR_API_KEY")
  openai_client = OpenAI()

  OpenAIInstrumentor().instrument()

  def traced_task(dataset_row) -> str:
      question = dataset_row.get("attributes.input.value", "")
      response = openai_client.chat.completions.create(
          model="gpt-4.1",
          messages=[{"role": "user", "content": question}],
      )
      return response.choices[0].message.content or ""

  experiment, experiment_df = client.experiments.run(
      name="auto-instrumented",
      dataset="YOUR_DATASET_ID_OR_NAME",
      task=traced_task,
      set_global_tracer_provider=True,
  )
  ```
</Accordion>

For broader tracing guidance (providers, exporters, and other instrumentors), see [Set up tracing](/ax/instrument/set-up-tracing) and the Python client's [OpenTelemetry tracing](/api-clients/python/version-8/tracing) reference.

<Card title="LangGraph tracing example" icon="code" href="https://colab.research.google.com/drive/1FIJLgqQrd255-cLIKgTeU2ToTQhRQuv9?usp=sharing">
  Runnable Google Colab notebook showing experiment tracing with LangGraph and nested spans.
</Card>

### Handle row-level failures

If you leave `exit_on_error=False`, inspect the returned DataFrame after the run and use the schema from your environment before deciding what to retry.

<Info>
  The exact columns in `experiment_df` can vary by run. Check `experiment_df.columns` in your environment before you hard-code a retry filter.
</Info>

A safe retry pattern is:

1. Inspect `experiment_df.columns` and identify how your environment marks failed rows.
2. Filter `experiment_df` down to just the rows you want to retry.
3. Keep only the original dataset columns when you build a temporary retry dataset.
4. Rerun `client.experiments.run()` against that temporary dataset.

***

## Next step

<Card title="CI/CD with experiments" icon="arrow-right" href="/ax/improve/ci-cd-for-automated-experiments">
  Automate experiments as regression gates on every PR or deploy. Remote runs are a natural fit for CI, but the same loop also works with `client.experiments.run()`.
</Card>

## Further reading

* [Run offline evals on experiments](/ax/evaluate/run-evals-on-experiments): attach evaluators after the experiment exists, or use the full evaluator inputs and outputs guide.
* [View and manage traces](/ax/observe/tracing/view-and-manage-traces): find the next failure mode to turn into dataset rows.
* [Python experiments API](/api-clients/python/version-8/client-resources/experiments): full parameter list for `run()` and `create()`.
