> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# CI/CD with experiments

> Automate experiments as regression gates on every PR or deploy so prompt, model, and pipeline changes are validated before they ship

Automated experiments catch regressions before a prompt, model, or pipeline change reaches production. Wire them into CI and every push runs the same evaluators against the same dataset as the prior baseline. Arize AX supports this through the two execution paths on [Experiment in code](/ax/improve/experiment-in-code): [Log an experiment](/ax/improve/experiment-in-code#log-an-experiment) for tasks CI already runs elsewhere, and [Run an experiment](/ax/improve/experiment-in-code#run-an-experiment) when the task fits inside a single Python function.

## Define the experiment file

The experiment file is the script CI runs on every trigger. It loads the dataset, defines the task, defines the evaluators, and calls the Python SDK v8 `run()` path. If your CI job already produces a results file from another runtime, use the remote experiment path from [Experiment in code](/ax/improve/experiment-in-code#log-an-experiment) instead.

### Dataset

Load the dataset from Arize AX so every CI invocation tests against the same fixed benchmark. See [Build a dataset](/ax/improve/build-a-dataset) for how the dataset itself is created and versioned.

```python Python theme={null}
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

dataset = client.datasets.get(dataset="YOUR_DATASET_ID")
dataset_df = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID",
    all=True,
).to_df()
```

### Task

Define the task function that mirrors the application logic you're testing. Import from your repo directly so the CI run tracks whatever's on the current branch:

```python Python theme={null}
import json
from openai import OpenAI

from prompt_func.search.search_router import ROUTER_TEMPLATE, avail_tools

TASK_MODEL = "gpt-4.1"
openai_client = OpenAI()

def task(dataset_row) -> str:
    prompt_vars = json.loads(
        dataset_row["attributes.llm.prompt_template.variables"]
    )
    response = openai_client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": ROUTER_TEMPLATE},
            {"role": "user", "content": json.dumps(prompt_vars)},
        ],
        tools=avail_tools,
    )
    tool_calls = response.choices[0].message.tool_calls or []
    return ", ".join(tool_call.function.name for tool_call in tool_calls)
```

See [Define a task](/ax/improve/experiment-in-code#define-a-task) for the full named-parameter mapping and prompt-versioning guidance.

### Evaluator

The evaluator scores each output. Code evaluators and LLM-as-a-judge both work; pick whichever matches the signal you need. Here's an LLM-as-a-judge that classifies function-selection correctness:

```python Python theme={null}
import os
import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel
from arize.experiments import EvaluationResult

def function_selection(output, dataset_row, **kwargs) -> EvaluationResult:
    expected_output = dataset_row["attributes.llm.output_messages"]
    df_in = pd.DataFrame(
        {"selected_output": [output], "expected_output": [expected_output]}
    )
    result = llm_classify(
        dataframe=df_in,
        template=EVALUATOR_TEMPLATE,
        model=OpenAIModel(
            model="gpt-4o-mini",
            api_key=os.environ["OPENAI_API_KEY"],
        ),
        rails=["incorrect", "correct"],
        provide_explanation=True,
    )
    label = result["label"][0]
    return EvaluationResult(
        score=1 if label == "correct" else 0,
        label=label,
        explanation=result["explanation"][0],
    )
```

For experiment evaluator inputs and outputs, see [Run offline evals on experiments](/ax/evaluate/run-evals-on-experiments). For runner-specific advanced patterns such as class-based evaluators and multiple evaluators, see [Experiment in code](/ax/improve/experiment-in-code#advanced-evaluator-patterns).

### Run the experiment

Call `run()` with a name that bumps on every CI invocation so each run records separately:

```python Python theme={null}
experiment, experiment_df = client.experiments.run(
    name="ai-search-v1.1",
    dataset="YOUR_DATASET_ID",
    task=task,
    evaluators=[function_selection],
)
```

## Gate the build on results

Use the returned DataFrame's mean evaluator score to decide whether the CI job passes or fails.

### Determine experiment success

Exit with code 0 when the run clears the threshold, 1 when it regresses:

```python Python theme={null}
import sys

EVAL_SCORE_COLUMN = "eval.function_selection.score"
MIN_MEAN_SCORE = 0.7

def determine_experiment_success(experiment_df) -> None:
    mean_score = experiment_df[EVAL_SCORE_COLUMN].mean()
    success = mean_score > MIN_MEAN_SCORE
    sys.exit(0 if success else 1)

determine_experiment_success(experiment_df)
```

### Auto-increment experiment names

Keep experiment names unique across CI runs by bumping a version suffix:

```python Python theme={null}
import re

def increment_experiment_name(experiment_name: str) -> str:
    # Example: "AI Search V1.1" → "AI Search V1.2"
    match = re.search(r"V(\d+)\.(\d+)", experiment_name)
    if not match:
        return experiment_name
    major, minor = map(int, match.groups())
    return re.sub(r"V\d+\.\d+", f"V{major}.{minor + 1}", experiment_name)
```

<Accordion title="Fetch experiment history via GraphQL (advanced)">
  For programmatic access to experiment history across runs on the same dataset, query the GraphQL API directly. Useful for diffing the current CI score against a tagged baseline.

  ```python Python theme={null}
  from gql import Client, gql
  from gql.transport.requests import RequestsHTTPTransport

  def fetch_experiment_details(gql_client, dataset_id):
      experiments_query = gql(
          """
          query getExperimentDetails($DatasetId: ID!) {
            node(id: $DatasetId) {
              ... on Dataset {
                name
                experiments(first: 10) {
                  edges {
                    node {
                      name
                      createdAt
                      evaluationScoreMetrics {
                        name
                        meanScore
                      }
                    }
                  }
                }
              }
            }
          }
          """
      )
      response = gql_client.execute(experiments_query, {"DatasetId": dataset_id})
      return [
          [edge["node"]["name"], metric["name"], metric["meanScore"]]
          for edge in response["node"]["experiments"]["edges"]
          for metric in edge["node"]["evaluationScoreMetrics"]
      ]
  ```

  Returns a flat list of `[experiment_name, metric_name, mean_score]` rows.
</Accordion>

## Set up the CI workflow

Once the experiment script runs end-to-end locally, wire it into your CI platform. Every platform follows the same pattern: checkout → install dependencies → run the experiment script → exit nonzero on failure.

<CardGroup cols={2}>
  <Card title="Azure DevOps" icon="microsoft" href="/ax/improve/azure-devops-ci-cd">
    `azure-pipelines.yml` with variable groups, PR triggers, and environments for promotion gates.
  </Card>

  <Card title="GitHub Actions" icon="github" href="/ax/improve/github-action-basics">
    `.github/workflows/*.yml` setup with `on:` triggers, path filters, and secrets wiring.
  </Card>

  <Card title="GitLab CI/CD" icon="gitlab" href="/ax/improve/gitlab-ci-cd-basics">
    `.gitlab-ci.yml` with `only:` conditions, merge-request triggers, and artifact retention.
  </Card>

  <Card title="Jenkins" icon="jenkins" href="/ax/improve/jenkins-integration">
    `Jenkinsfile` with Docker agent, Multibranch Pipeline, and PR comment reporting.
  </Card>

  <Card title="Harness" icon="gears" href="/ax/improve/harness-ci-cd">
    Harness pipeline YAML with webhook triggers, matrix runs, and notifications.
  </Card>
</CardGroup>

## Further reading

* [Experiment in code](/ax/improve/experiment-in-code): the execution paths (Log and Run) that CI jobs invoke.
* [Python experiments API](/api-clients/python/version-8/client-resources/experiments): full reference for `run()`, `create()`, and `list_runs()`.
* [Build a dataset](/ax/improve/build-a-dataset): the benchmark that every CI run scores against.
