Skip to main content
What this enables: Run Arize AX experiment evaluations automatically as part of your Jenkins pipelines — on every PR, on a schedule, or on-demand. Catch regressions in accuracy, latency, and cost before they hit production.

Key Concepts

  • Pipeline: An automated workflow defined in a Jenkinsfile (Groovy-based, not YAML).
  • Stages: Named groups of work that run sequentially (e.g., Setup, Test, Report).
  • Steps: Individual commands within a stage.
  • Agent: Where the pipeline runs — a Jenkins node, a Docker container, or a Kubernetes pod.
  • Triggers: How pipelines get kicked off — webhooks, cron schedules, or upstream jobs.

Prerequisites & Assumptions

This guide assumes:
  • Jenkins is running on a recent LTS release with Java 17+. See Java support policy for details.
  • A Jenkins agent capable of running Docker containers (needed for the Python image approach below), or Python 3.12+ installed directly on the agent.
  • Your Jenkins instance can reach your Git provider (GitHub, GitLab, Bitbucket) via webhook or polling.
  • Required plugins are installed:
🔑 Secrets setup: Before your pipeline can run, store your API keys in Jenkins → Manage Jenkins → Credentials. Add OPENAI_API_KEY, ARIZE_API_KEY, SPACE_ID, and DATASET_ID as “Secret text” credentials. The Jenkinsfile below references these by their credential IDs.

Setting Up Your First Experiment Pipeline

Create a Jenkinsfile

Place a Jenkinsfile in the root of your repository. Jenkins uses a Groovy-based DSL (not YAML).
pipeline {
    agent {
        docker {
            image 'python:3.12'
        }
    }

    environment {
        OPENAI_API_KEY  = credentials('OPENAI_API_KEY')
        ARIZE_API_KEY   = credentials('ARIZE_API_KEY')
        SPACE_ID        = credentials('SPACE_ID')
        DATASET_ID      = credentials('DATASET_ID')
    }

    stages {
        stage('Install Dependencies') {
            steps {
                sh 'pip install -q arize arize-phoenix nest_asyncio packaging openai "gql[all]"'
            }
        }

        stage('Run Experiment') {
            steps {
                sh 'python ./copilot/experiments/ai_search_test.py'
            }
        }
    }

    post {
        always {
            archiveArtifacts artifacts: 'experiment_results.json', allowEmptyArchive: true
        }
    }
}

Breakdown

  • pipeline { } — Top-level block; everything lives inside this.
  • agent { docker { image '...' } } — Runs the entire pipeline inside a Docker container. Jenkins pulls the image for you.
  • environment { } — Injects secrets from the Jenkins credential store. The credentials() helper masks values in logs automatically.
  • stages / stage — Sequential groups of work. Each stage appears as a separate column in the Pipeline Stage View.
  • steps — Commands to execute. sh runs shell commands.
  • post { always { } } — Runs after all stages complete (pass or fail). archiveArtifacts saves files to the Jenkins build page for download.
No Docker? Replace the agent block with agent any and make sure Python 3.12+ is on your Jenkins node. You may also want to add a sh 'python3 --version' step to verify.

Trigger Options

Unlike repo-hosted CI systems where triggers are defined entirely in the pipeline file, Jenkins separates what runs (Jenkinsfile) from when it runs (job configuration). Triggers can be set in the Jenkinsfile itself using the triggers directive, configured in the Jenkins UI, or driven by webhooks from your Git provider.

1. Webhook (Push / Pull Request)

The most common setup. Your Git provider sends a webhook to Jenkins when code changes. Setup: Configure a webhook in your Git provider pointing to https://<your-jenkins>/github-webhook/ (for GitHub) or the equivalent endpoint. Then in Jenkins, create a Multibranch Pipeline job pointing to your repo.
// No triggers block needed — Multibranch Pipeline jobs
// automatically build on push when webhooks are configured.
pipeline {
    agent any
    stages {
        stage('Test') {
            steps {
                sh 'echo "Triggered by push or PR"'
            }
        }
    }
}
Multibranch Pipeline is the recommended job type for most teams. It automatically discovers branches and PRs in your repo and runs the Jenkinsfile found in each. No manual job creation per branch.

2. Scheduled (Cron)

pipeline {
    agent any
    triggers {
        // Jenkins cron: MINUTE HOUR DOM MONTH DOW
        cron('0 0 * * *')  // Every day at midnight
    }
    stages {
        stage('Nightly Eval') {
            steps {
                sh 'python ./copilot/experiments/ai_search_test.py'
            }
        }
    }
}
Cron syntax note: Jenkins cron uses H (hash) for load distribution. H 0 * * * means “sometime in the midnight hour” — Jenkins picks a stable minute per job to avoid all jobs firing at :00. Use exact times only when it actually matters.

3. Polling SCM (Fallback When Webhooks Aren’t Possible)

Jenkins periodically checks your repo for changes. Use this when your Jenkins instance isn’t reachable from your Git provider (e.g., behind a firewall).
pipeline {
    agent any
    triggers {
        pollSCM('H/5 * * * *')  // Check every 5 minutes
    }
    stages {
        stage('Test') {
            steps {
                sh 'echo "Detected new changes"'
            }
        }
    }
}

4. Upstream Job (Pipeline Chaining)

Trigger one pipeline after another completes — useful for running evals only after a build passes.
pipeline {
    agent any
    triggers {
        upstream(upstreamProjects: 'my-build-job', threshold: hudson.model.Result.SUCCESS)
    }
    stages {
        stage('Post-Build Eval') {
            steps {
                sh 'python ./copilot/experiments/ai_search_test.py'
            }
        }
    }
}

5. Manual Only (No Automatic Trigger)

Omit the triggers block entirely. The pipeline runs only when someone clicks Build Now in the Jenkins UI or calls the Jenkins API.
pipeline {
    agent any
    // No triggers block — manual execution only
    stages {
        stage('On-Demand Eval') {
            steps {
                sh 'python ./copilot/experiments/ai_search_test.py'
            }
        }
    }
}

6. Parameterized Builds

Allow users to pass inputs when triggering a build — useful for running experiments against different datasets or models.
pipeline {
    agent any
    parameters {
        string(name: 'DATASET_ID', defaultValue: 'default-dataset', description: 'Arize dataset to evaluate against')
        choice(name: 'MODEL', choices: ['gpt-4o', 'gpt-4o-mini', 'claude-sonnet-4-5-20250929'], description: 'Model to test')
    }
    stages {
        stage('Run Experiment') {
            steps {
                sh "python ./copilot/experiments/ai_search_test.py --dataset ${params.DATASET_ID} --model ${params.MODEL}"
            }
        }
    }
}

Scoping Pipelines to Specific File Changes

If you only want experiments to run when relevant code changes (prompt templates, retrieval logic, eval scripts), you can scope stages using a changeset condition. With Multibranch Pipeline jobs, every push triggers a build — this lets you skip the experiment stage when irrelevant files change:
pipeline {
    agent any
    stages {
        stage('Run Experiment') {
            when {
                changeset 'copilot/search/**'
            }
            steps {
                sh 'python ./copilot/experiments/ai_search_test.py'
            }
        }
    }
}
This still triggers the pipeline, but the stage is skipped if no files in copilot/search/ changed. The build will show as successful (just with a skipped stage).
⚠️ Important distinction: This is stage-level filtering, not pipeline-level. The pipeline still starts, checks out code, and evaluates the condition. For high-frequency repos, this can mean a lot of no-op builds. If that’s a concern, look into the Generic Webhook Trigger plugin, which can inspect the webhook payload before starting a build.

More Mature Patterns

Once you have the basics working, here are patterns that become relevant as your experiment workflows grow.

Parallel Evaluation Runs

Run experiments against multiple models or datasets simultaneously:
stage('Evaluate Models') {
    parallel {
        stage('GPT-4o') {
            steps {
                sh 'python ./experiments/eval.py --model gpt-4o'
            }
        }
        stage('Claude Sonnet') {
            steps {
                sh 'python ./experiments/eval.py --model claude-sonnet-4-5-20250929'
            }
        }
    }
}

Shared Libraries

If multiple repos need the same experiment setup (install deps, configure credentials, post results to Arize), extract it into a Jenkins Shared Library:
// In your Jenkinsfile — after setting up the shared library in Jenkins config
@Library('arize-experiment-lib') _

pipeline {
    agent { docker { image 'python:3.12' } }
    stages {
        stage('Run') {
            steps {
                arizeExperiment(script: './copilot/experiments/ai_search_test.py')
            }
        }
    }
}

Post Results as PR Comments

Use pipeline steps to post experiment results directly on the PR, so reviewers can see the impact of code changes without leaving their Git provider:
post {
    success {
        script {
            def results = readJSON file: 'experiment_results.json'
            // Post summary to PR (requires Git provider plugin + credentials)
            pullRequest.comment("## Experiment Results\n- Accuracy: ${results.accuracy}\n- Latency p50: ${results.latency_p50}ms")
        }
    }
}

Notifications

post {
    failure {
        slackSend channel: '#ml-experiments', message: "❌ Experiment failed: ${env.BUILD_URL}"
    }
    success {
        slackSend channel: '#ml-experiments', message: "✅ Experiment passed: ${env.BUILD_URL}"
    }
}