Skip to main content
Automated experiments catch regressions before a prompt, model, or pipeline change reaches production. Wire them into CI and every push runs the same evaluators against the same dataset as the prior baseline. Arize supports this through the two execution paths on Experiment in code: Log an experiment for tasks CI already runs elsewhere, and Run an experiment when the task fits inside a single Python function.

Define the experiment file

The experiment file is the script CI runs on every trigger. It loads the dataset, defines the task, defines the evaluators, and calls the Python SDK v8 run() path. If your CI job already produces a results file from another runtime, use the remote experiment path from Experiment in code instead.

Dataset

Load the dataset from Arize so every CI invocation tests against the same fixed benchmark. See Build a dataset for how the dataset itself is created and versioned.
Python
from arize import ArizeClient

client = ArizeClient(api_key="YOUR_API_KEY")

dataset = client.datasets.get(dataset="YOUR_DATASET_ID")
dataset_df = client.datasets.list_examples(
    dataset="YOUR_DATASET_ID",
    all=True,
).to_df()

Task

Define the task function that mirrors the application logic you’re testing. Import from your repo directly so the CI run tracks whatever’s on the current branch:
Python
import json
from openai import OpenAI

from prompt_func.search.search_router import ROUTER_TEMPLATE, avail_tools

TASK_MODEL = "gpt-4.1"
openai_client = OpenAI()

def task(dataset_row) -> str:
    prompt_vars = json.loads(
        dataset_row["attributes.llm.prompt_template.variables"]
    )
    response = openai_client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": ROUTER_TEMPLATE},
            {"role": "user", "content": json.dumps(prompt_vars)},
        ],
        tools=avail_tools,
    )
    tool_calls = response.choices[0].message.tool_calls or []
    return ", ".join(tool_call.function.name for tool_call in tool_calls)
See Define a task for the full named-parameter mapping and prompt-versioning guidance.

Evaluator

The evaluator scores each output. Code evaluators and LLM-as-a-judge both work; pick whichever matches the signal you need. Here’s an LLM-as-a-judge that classifies function-selection correctness:
Python
import os
import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel
from arize.experiments import EvaluationResult

def function_selection(output, dataset_row, **kwargs) -> EvaluationResult:
    expected_output = dataset_row["attributes.llm.output_messages"]
    df_in = pd.DataFrame(
        {"selected_output": [output], "expected_output": [expected_output]}
    )
    result = llm_classify(
        dataframe=df_in,
        template=EVALUATOR_TEMPLATE,
        model=OpenAIModel(
            model="gpt-4o-mini",
            api_key=os.environ["OPENAI_API_KEY"],
        ),
        rails=["incorrect", "correct"],
        provide_explanation=True,
    )
    label = result["label"][0]
    return EvaluationResult(
        score=1 if label == "correct" else 0,
        label=label,
        explanation=result["explanation"][0],
    )
For experiment evaluator inputs and outputs, see Run offline evals on experiments. For runner-specific advanced patterns such as class-based evaluators and multiple evaluators, see Experiment in code.

Run the experiment

Call run() with a name that bumps on every CI invocation so each run records separately:
Python
experiment, experiment_df = client.experiments.run(
    name="ai-search-v1.1",
    dataset="YOUR_DATASET_ID",
    task=task,
    evaluators=[function_selection],
)

Gate the build on results

Use the returned DataFrame’s mean evaluator score to decide whether the CI job passes or fails.

Determine experiment success

Exit with code 0 when the run clears the threshold, 1 when it regresses:
Python
import sys

EVAL_SCORE_COLUMN = "eval.function_selection.score"
MIN_MEAN_SCORE = 0.7

def determine_experiment_success(experiment_df) -> None:
    mean_score = experiment_df[EVAL_SCORE_COLUMN].mean()
    success = mean_score > MIN_MEAN_SCORE
    sys.exit(0 if success else 1)

determine_experiment_success(experiment_df)

Auto-increment experiment names

Keep experiment names unique across CI runs by bumping a version suffix:
Python
import re

def increment_experiment_name(experiment_name: str) -> str:
    # Example: "AI Search V1.1" → "AI Search V1.2"
    match = re.search(r"V(\d+)\.(\d+)", experiment_name)
    if not match:
        return experiment_name
    major, minor = map(int, match.groups())
    return re.sub(r"V\d+\.\d+", f"V{major}.{minor + 1}", experiment_name)
For programmatic access to experiment history across runs on the same dataset, query the GraphQL API directly. Useful for diffing the current CI score against a tagged baseline.
Python
from gql import Client, gql
from gql.transport.requests import RequestsHTTPTransport

def fetch_experiment_details(gql_client, dataset_id):
    experiments_query = gql(
        """
        query getExperimentDetails($DatasetId: ID!) {
          node(id: $DatasetId) {
            ... on Dataset {
              name
              experiments(first: 10) {
                edges {
                  node {
                    name
                    createdAt
                    evaluationScoreMetrics {
                      name
                      meanScore
                    }
                  }
                }
              }
            }
          }
        }
        """
    )
    response = gql_client.execute(experiments_query, {"DatasetId": dataset_id})
    return [
        [edge["node"]["name"], metric["name"], metric["meanScore"]]
        for edge in response["node"]["experiments"]["edges"]
        for metric in edge["node"]["evaluationScoreMetrics"]
    ]
Returns a flat list of [experiment_name, metric_name, mean_score] rows.

Set up the CI workflow

Once the experiment script runs end-to-end locally, wire it into your CI platform. Every platform follows the same pattern: checkout → install dependencies → run the experiment script → exit nonzero on failure.

GitHub Actions

.github/workflows/*.yml setup with on: triggers, path filters, and secrets wiring.

GitLab CI/CD

.gitlab-ci.yml with only: conditions, merge-request triggers, and artifact retention.

Jenkins

Jenkinsfile with Docker agent, Multibranch Pipeline, and PR comment reporting.

Harness

Harness pipeline YAML with webhook triggers, matrix runs, and notifications.

Further reading