Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

The Arize AX Airflow Provider is the official Apache Airflow provider for Arize AX. It ships operators, sensors, and hooks so you can build DAGs that drive Arize from end to end without leaving Airflow. Common use cases:
  • Run continuous evaluations. Attach LLM judges to live projects and gate deployments on eval scores.
  • Manage the prompt lifecycle. Create, version, compare, label, and promote prompts in Prompt Hub.
  • Curate and refresh evaluation datasets. Append production examples, score dataset health, evolve datasets as your traffic changes.
  • Run and compare experiments. Pre-computed or platform-executed, with regression and drift detection.
  • Export spans by trace, session, or time window into DataFrames or Parquet for downstream pipelines.
  • Orchestrate human-in-the-loop workflows. Annotation queues, agreement metrics, and feedback back into evaluators.

Apache Airflow Example DAGs (GitHub)

Architecture

The provider sits between Airflow and the Arize AX platform. Operators wrap the Arize Python SDK, so every DAG task ends up as a typed call against the same APIs the SDK exposes. Production LLM systems continue to send OpenInference traces to Arize directly; the DAGs you build with this provider read and act on that data.
End-to-end architecture of the Arize AX Airflow Provider. Data engineers, LLMOps engineers, and AI platform teams author DAGs in Apache Airflow. The Arize AX Airflow Provider exposes operators, sensors, hooks, and example DAGs for datasets, experiments, projects, spans, evaluators, tasks, prompts, annotations, AI integrations, API keys, spaces, and ML. The provider calls the Arize Python SDK ArizeClient over REST, gRPC, and Arrow Flight. The Arize AX platform provides tracing, evaluation, datasets, prompts, the Eval Hub, the Prompt Hub, and annotation queues. Production LLM systems including LLM applications, AI agents, and RAG systems send OpenInference traces directly to Arize AX.

Prerequisites

  • Apache Airflow 2.4+
  • Python 3.10+
  • An Arize AX account
  • An Arize AX API key with read/write permissions for the resources your DAGs manage
The provider depends on the Arize Python SDK v8 and pins a tested version.

Launch Arize

  1. Sign in to your Arize AX account.
  2. From Space Settings, copy your Space ID and API Key. You will set them as the password and default_space of the Airflow connection below.

Install

pip install arize-ax-airflow-provider

Configure credentials

All operators authenticate through a single Airflow connection. The default connection ID is arize_ax_default.
1

Add an Airflow connection

In Airflow, create a new connection:
FieldValue
Connection Idarize_ax_default
Connection TypeArize AX (arize_ax)
PasswordYour Arize AX API key
Hostapi.arize.com (or your regional endpoint, for example api.us-central-1a.arize.com)
ExtraThe JSON below. Sets a default Space ID so operators don’t need to repeat it.
Extra
{
  "default_space": "your-space-id",
  "region": "US_CENTRAL_1A",
  "api_scheme": "https"
}
The provider also exposes custom UI fields for space_id, region, and api_scheme, so you can fill them in without editing the Extra blob.
2

(Optional) Set Airflow Variables

Operators resolve space_id in this order: the operator argument, then the connection’s extra.default_space, then the Airflow Variable arize_ax_space_id.If you set arize_ax_space_id as a Variable, you can template space_id="{{ var.value.get('arize_ax_space_id', None) }}" once and reuse it across DAGs.

Run a DAG

This DAG creates a dataset and lists its examples. Both operators are idempotent, so you can re-run it safely.
from datetime import datetime
from airflow import DAG
from airflow.providers.arize_ax.operators.datasets import (
    ArizeAxCreateDatasetOperator,
    ArizeAxListDatasetExamplesOperator,
)

with DAG(
    dag_id="arize_ax_quickstart",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
    tags=["arize_ax", "quickstart"],
) as dag:
    create_dataset = ArizeAxCreateDatasetOperator(
        task_id="create_dataset",
        space_id="{{ var.value.get('arize_ax_space_id', None) }}",
        name="quickstart-dataset",
        examples=[
            {"input": "What is 2+2?", "expected_output": "4"},
            {"input": "Capital of France?", "expected_output": "Paris"},
        ],
        if_exists="skip",
    )

    list_examples = ArizeAxListDatasetExamplesOperator(
        task_id="list_examples",
        dataset_id="{{ ti.xcom_pull(task_ids='create_dataset') }}",
    )

    create_dataset >> list_examples

Expected output

[create_dataset] Creating dataset 'quickstart-dataset' in space <your-space-id>
[create_dataset] Created dataset id=ds_01HXYZ1234ABCDEF (2 examples)
[create_dataset] Returning dataset_id to XCom: ds_01HXYZ1234ABCDEF
[list_examples] Fetching examples for dataset ds_01HXYZ1234ABCDEF
[list_examples] {"items": [...], "count": 2, "next_cursor": null}

Verify in Arize

  1. Open your Arize AX space and navigate to Datasets.
  2. You should see the quickstart-dataset dataset with 2 examples within ~30 seconds of the DAG completing.
  3. If the dataset doesn’t appear, see Troubleshooting.

Troubleshooting

  • Connection arize_ax_default not found. Confirm the Airflow connection was created with that exact ID, or override arize_ax_conn_id on each operator.
  • 401 Unauthorized from Arize. The API key is missing, revoked, or doesn’t have the resource permission the operator requires. Regenerate the key in Space Settings → API Keys.
  • Tasks succeed but the resource doesn’t appear in Arize. The connection is pointing at the wrong region. Match the Host field (for example api.us-central-1a.arize.com) to the region of the space you copied the API key from.
  • space_id resolution error. Operators resolve space_id from the operator argument first, then the connection’s extra.default_space, then the arize_ax_space_id Airflow Variable. Set one of these.
  • Sensor times out. Either the upstream task never produced the resource the sensor is watching, or the timeout is shorter than the real latency. Increase timeout or check upstream task logs.

Design patterns

A handful of patterns show up across every example DAG. Knowing them is most of what makes a DAG safe to re-run and easy to gate.

Idempotent creates

Every Create* operator accepts if_exists="skip" (the default is "fail"). On a 409 Conflict, the operator looks up the existing resource by name and returns its ID, so a re-run doesn’t error out and doesn’t require manual cleanup.
ArizeAxCreateDatasetOperator(
    task_id="create_dataset",
    space_id="{{ var.value.get('arize_ax_space_id') }}",
    name="prod-eval-dataset",
    if_exists="skip",
)
Delete* operators take the inverse flag, ignore_if_missing=True (also the default). A 404 becomes a no-op.

Built-in eval gates

Comparison and scoring operators raise AirflowException when a quality bar isn’t met, and Airflow’s normal failure semantics block downstream tasks. You don’t need a ShortCircuitOperator.
OperatorGate parameter
ArizeAxGetExperimentScoreOperatormin_score=0.7
ArizeAxCompareExperimentsOperatorfail_on_regression=True
ArizeAxDetectEvalDriftOperatorfail_on_drift=True
ArizeAxBehavioralRegressionOperatorfail_on_regression=True
ArizeAxEvaluatorCalibrationOperatorfail_on_poor_calibration=True

XCom output shape

List operators push a normalized payload to XCom and expose convenience keys, so downstream tasks don’t have to unpack the response.
{"items": [...], "count": 3, "next_cursor": "..."}
# Convenience XCom keys: "first_id", "first_name"
get_first = ArizeAxListDatasetsOperator(
    task_id="list_datasets",
    space_id="{{ ... }}",
)
use_first = ArizeAxGetDatasetOperator(
    task_id="get_first",
    dataset_id="{{ ti.xcom_pull(task_ids='list_datasets', key='first_id') }}",
)
Create* and RunExperiment operators return the new resource ID as a scalar, so you can pull it directly with ti.xcom_pull(task_ids='...').

Wait with sensors, not sleep

Use sensors to gate downstream work on real Arize state instead of fixed delays:
  • ArizeAxExperimentRunCountSensor blocks until N runs complete.
  • ArizeAxEvaluationScoreSensor blocks until a metric mean clears a threshold.
  • ArizeAxSpanCountSensor blocks until ingestion produces enough spans to evaluate.
  • ArizeAxTaskRunSensor blocks until an evaluation task run reaches a terminal state.
Every sensor accepts the standard Airflow poke_interval, timeout, mode, and soft_fail.

Templating with Jinja

Every operator declares template_fields for its runtime-resolved parameters: space_id, project_name, dataset_id, start_time, end_time, and the like. Use Jinja to pull from Variables, XCom, or the DAG context:
space_id="{{ var.value.get('arize_ax_space_id', None) }}"
start_time="{{ data_interval_start }}"
end_time="{{ data_interval_end }}"
dataset_id="{{ ti.xcom_pull(task_ids='create_dataset') }}"

Re-evaluate existing spans

ArizeAxTriggerTaskRunOperator accepts override_evaluations=True. When you change an evaluator template, you can re-score the same time window without manually clearing spans first.

Example DAGs

The provider ships a library of example DAGs covering common LLMOps patterns end to end. Three representative ones are below.

LLM CI/CD gate

Run a candidate experiment, score it against a stored baseline, and fail the DAG if the candidate doesn’t beat the baseline by a configured threshold. The failure blocks the downstream promotion task.
from airflow.providers.arize_ax.operators.experiments import (
    ArizeAxCompareExperimentsOperator,
    ArizeAxGetExperimentScoreOperator,
    ArizeAxRunExperimentOperator,
)
from airflow.providers.arize_ax.sensors.arize_ax import (
    ArizeAxExperimentRunCountSensor,
)

run_candidate = ArizeAxRunExperimentOperator(
    task_id="run_candidate",
    name="cicd-candidate-{{ ts_nodash }}",
    dataset_id="{{ ti.xcom_pull(task_ids='create_dataset') }}",
    task=my_llm_task,
    evaluators=[accuracy_evaluator],
    concurrency=4,
)

wait_for_runs = ArizeAxExperimentRunCountSensor(
    task_id="wait_for_runs",
    experiment_id="{{ ti.xcom_pull(task_ids='run_candidate') }}",
    min_runs=5,
    poke_interval=15,
    timeout=600,
)

compare = ArizeAxCompareExperimentsOperator(
    task_id="compare",
    candidate_experiment_id="{{ ti.xcom_pull(task_ids='run_candidate') }}",
    baseline_experiment_id=(
        "{{ var.value.get('arize_ax_baseline_experiment_id') }}"
    ),
    pass_threshold=0.0,
    aggregation="mean",
    fail_on_regression=True,
)

run_candidate >> wait_for_runs >> compare >> promote_task
Full DAG: example_arize_ax_llm_cicd_gate_dag.py.

Prompt lifecycle

Pull a prompt by ID, run a staging evaluation, score it, label it staging, run a production-environment evaluation, compare against the live production version, and promote on pass. Key operators: ArizeAxGetPromptOperator, ArizeAxRunExperimentOperator, ArizeAxGetExperimentScoreOperator, ArizeAxPromotePromptOperator, ArizeAxCompareExperimentsOperator. Full DAG: example_arize_ax_prompt_lifecycle_dag.py.

Dataset curation from production spans

On a daily schedule, export production spans that match a quality filter, deduplicate them, and append them to a long-running evaluation dataset. Key operators: ArizeAxSpansExportToDataframeOperator, ArizeAxCurateSpansToDatasetOperator, ArizeAxAppendDatasetExamplesOperator. Pair them with ArizeAxEvalDatasetHealthOperator to track freshness, diversity, and coverage drift. Full DAG: example_arize_ax_dataset_curation_dag.py.

More patterns

Other DAGs cover drift detection with auto-rollback, behavioral regression checks, prompt A/B testing, RAG evaluation, fine-tuning data export, and HITL annotation queues. Browse them in airflow_example_dags/.

Resources

Operator reference

Every operator, sensor, and hook, grouped by domain.

Example DAGs

Runnable DAGs covering CI/CD gating, prompt lifecycle, dataset curation, and more.

Apache Airflow

Apache Airflow documentation and project homepage.

Python SDK v8

The underlying ArizeClient API the provider wraps.