If a prompt change can break things in production, a prompt change should pass the same kind of check that any other change passes. That’s what CI/CD for prompts is — a code-style PR check that runs an experiment, applies a success predicate, and blocks the merge if the predicate fails. This page covers the conceptual shape of CI/CD for prompts. The how-to-wire-it-up for specific CI systems lives in the Experiments + CI/CD docs.

The shape

Flow showing a pull request triggering a CI job, the job running an experiment script that calls Arize AX, the experiment-success predicate evaluating the results, and a pass or fail signal returned to the PR check — The CI/CD prompt flow — PR opens, CI runs an experiment, success predicate decides pass or fail, and the merge is blocked or allowed accordingly.

Three moving parts:

An experiment script — Python, TypeScript, or ax CLI — that defines the dataset, the task (running the prompt), and the evaluator(s). Same shape as a code-driven experiment you’d run by hand.
An experiment-success predicate — a boolean expression over evaluator scores. Examples: exact_match > 0.7, hallucination_rate < 0.2, accuracy >= baseline.accuracy. The script exits non-zero when the predicate fails.
A CI workflow file — GitHub Actions / GitLab CI / Jenkins / Harness — that runs the script on every PR and reports pass/fail back to the merge check.

The PR doesn’t merge unless the predicate passes. A prompt edit that hurts eval scores fails CI the same way a code change that breaks tests fails CI.

Why this matters

Without CI on prompts, two failure classes are silent:

Prompt regression. Someone tightens the system message and accidentally makes the output less concise. Eval scores drop. Nobody notices until users complain.
Model drift. A provider releases a new model version; the same prompt now behaves differently. Without a CI check, the drift only surfaces when someone happens to look.

The eval-against-a-fixed-dataset pattern catches both. The dataset is the constant; if the prompt or the model changes, the score delta against the same dataset is what tells you whether the change was a regression.

The experiment-success predicate

The predicate is the smallest piece of code in the system, and the most consequential. It encodes the bar. Some patterns:

Predicate	What it asserts
`exact_match > 0.7`	At least 70% of rows must produce an exact match against the reference column.
`hallucination_rate < 0.2`	Less than 20% of outputs can be labeled as hallucinations by the judge.
`mean_score >= 0.85`	The average evaluator score across the dataset must be at least 0.85.
`accuracy >= prior.accuracy - 0.02`	The change can’t drop accuracy by more than 2 points vs the prior baseline.
`no_regression(prior_run)`	No specific row that the prior version got right can be wrong in the new version.

The right predicate depends on what your application cares about. Don’t regress on any previously-passing row is strict; average score has to stay above 0.85 is permissive. Both have their place.

What the workflow file does

The CI workflow is thin. It runs three steps:

Set up — install the SDK, set the Arize AX API key.
Run the experiment script — same script you’d run locally.
Report — the script’s exit code becomes the check’s pass/fail.

GitHub Actions is the most common shape, but the same pattern works for GitLab CI, Jenkins, Harness, and Azure DevOps. The specifics live in the per-system docs:

What you get

Regression detection at PR time. A prompt change that hurts scores fails the check; you see it in the PR before merging.
Confident prompt edits. Reviewers know the change passed the eval bar before approving.
Auditable history. Every prompt version that shipped passed a specific eval against a specific dataset. The CI run record is the audit trail.
Model-swap safety. Update the provider or model in the Prompt Object, push the change, and CI tells you whether the new model still meets the bar against your dataset.

The combination of immutable versions, comparable experiments, and a CI predicate is what makes prompt changes safe at the same level of rigor as code changes.

Next step

The iteration loop is fast and safe. The final page covers what to reach for when you want the iteration done for you.

OpenTelemetry and OpenInference

Prompts

Evaluators

adb

Prompts in CI/CD

The shape

Why this matters

The experiment-success predicate

What the workflow file does

What you get

Next step

Next: Optimizing Prompts

​The shape

​Why this matters

​The experiment-success predicate

​What the workflow file does

​What you get

​Next step

Next: Optimizing Prompts

The shape

Why this matters

The experiment-success predicate

What the workflow file does

What you get

Next step