Define the experiment file
The experiment file is the script CI runs on every trigger. It loads the dataset, defines the task, defines the evaluators, and calls the Python SDK v8run() path. If your CI job already produces a results file from another runtime, use the remote experiment path from Experiment in code instead.
Dataset
Load the dataset from Arize so every CI invocation tests against the same fixed benchmark. See Build a dataset for how the dataset itself is created and versioned.Python
Task
Define the task function that mirrors the application logic you’re testing. Import from your repo directly so the CI run tracks whatever’s on the current branch:Python
Evaluator
The evaluator scores each output. Code evaluators and LLM-as-a-judge both work; pick whichever matches the signal you need. Here’s an LLM-as-a-judge that classifies function-selection correctness:Python
Run the experiment
Callrun() with a name that bumps on every CI invocation so each run records separately:
Python
Gate the build on results
Use the returned DataFrame’s mean evaluator score to decide whether the CI job passes or fails.Determine experiment success
Exit with code 0 when the run clears the threshold, 1 when it regresses:Python
Auto-increment experiment names
Keep experiment names unique across CI runs by bumping a version suffix:Python
Fetch experiment history via GraphQL (advanced)
Fetch experiment history via GraphQL (advanced)
For programmatic access to experiment history across runs on the same dataset, query the GraphQL API directly. Useful for diffing the current CI score against a tagged baseline.Returns a flat list of
Python
[experiment_name, metric_name, mean_score] rows.Set up the CI workflow
Once the experiment script runs end-to-end locally, wire it into your CI platform. Every platform follows the same pattern: checkout → install dependencies → run the experiment script → exit nonzero on failure.GitHub Actions
.github/workflows/*.yml setup with on: triggers, path filters, and secrets wiring.GitLab CI/CD
.gitlab-ci.yml with only: conditions, merge-request triggers, and artifact retention.Jenkins
Jenkinsfile with Docker agent, Multibranch Pipeline, and PR comment reporting.Harness
Harness pipeline YAML with webhook triggers, matrix runs, and notifications.
Further reading
- Experiment in code: the execution paths (Log and Run) that CI jobs invoke.
- Python experiments API: full reference for
run(),create(), andlist_runs(). - Build a dataset: the benchmark that every CI run scores against.