Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

Use this page when Kubernetes is already running and is managed outside GKE, EKS, AKS, or OpenShift. This includes Rancher-managed clusters, Talos clusters, and other bare-metal or private Kubernetes environments. Your platform team owns the cluster-level choices: storage classes, ingress, DNS, registry access, and object storage. The Arize AX install flow is still the same: point kubectl at the target cluster, create values.yaml next to arize.sh, then run ./arize.sh install from the extracted distribution directory.

Before you start

You need:
  • A working kubeconfig for the target cluster.
  • Block-storage-backed persistent volumes. Do not use NFS-backed volumes for Arize AX persistent volumes.
  • Object storage for Gazette and ArizeDB data. For bare-metal installs this is usually MinIO, Ceph, or another S3-compatible service.
  • A storage class name for standard persistent volumes and SSD-style persistent volumes. They can be the same storage class if the cluster does not split storage tiers.
  • The application URL you plan to expose Arize AX at, for appBaseUrl. DNS does not need to resolve at install time; you can wire DNS and ingress afterwards. Pick the URL you intend to keep, though: OAuth callbacks, application redirects, and the configmap rendered by the operator all reference appBaseUrl, so changing it later means re-rendering the install configuration.
  • The release version from On-Premise Releases, plus distribution access, organization name, and sizing profile from Arize AI.
Use dedicated namespaces for Arize AX. The examples use arize and arize-operator; if you choose different names, keep them dedicated to this Arize AX install. Do not install Arize AX into a namespace shared with other applications. Cleanup and reinstall commands can delete namespace resources and are not safe for shared namespaces.
If you are using Rancher or Talos, confirm which kubeconfig cluster entry the installer should use:
kubectl config get-clusters
kubectl config current-context
Set clusterName to the cluster entry from kubectl config get-clusters. Do not assume it is always the same as the current context name.

Decide whether this is an upgrade or a reset

If Arize AX is already installed, running ./arize.sh install again is the normal path for a refresh, a redeploy of operator-managed manifests, or an upgrade with your current values.yaml.
Use Fresh reinstall cleanup only when you intentionally want to discard the existing install and reinstall from an empty target.
If a previous install failed partway through, check the operator, jobs, and pods before deciding whether to continue with ./arize.sh install or reset the target. A reset deletes in-cluster Arize AX data unless you have a backup/restore plan.

Choose the storage mode

For most Rancher and bare-metal installs, start with one of these values:
Environmentcloud valueObject storage
Operator-managed MinIO in the clusterminioMinIO deployed with Arize AX
Existing Ceph or S3-compatible endpointcephExternal S3-compatible service
Use minio when you want the Arize AX install to manage MinIO in the cluster. Use ceph when your platform team already provides an S3-compatible endpoint and credentials. With cloud: minio, the operator’s install-minio-init job creates gazetteBucket and druidBucket inside the operator-managed MinIO during the install. Pick names that are unique to your install (they only need to be unique within this MinIO); you do not pre-provision them. With cloud: ceph or another S3-compatible endpoint, pre-provision both buckets in your platform’s storage and grant the install credentials read/write access.

Create values.yaml

Create values.yaml in the extracted distribution directory, next to arize.sh. This minimal example is for a bare-metal cluster using operator-managed MinIO:
helmNamespace: arize-operator
cloud: "minio"
clusterName: "<cluster-entry-from-kubectl-config-get-clusters>"
organizationName: "<organization-name>"
clusterSizing: "<sizing-profile>"

gazetteBucket: "<gazette-bucket-name>"
druidBucket: "<adb-bucket-name>"

hubJwt: "<base64-encoded-runtime-registry-jwt>"
postgresPassword: "<base64-encoded-postgres-password>"
cipherKey: "<base64-encoded-cipher-key>"
minioPassword: "<base64-encoded-minio-password>"

appBaseUrl: "https://<arize-app-domain>"

storageClassMinioStandard: "<standard-storage-class>"
storageClassMinioSsd: "<ssd-storage-class>"

# Set this to false if the cluster does not have a separate ArizeDB node pool.
historicalNodePoolEnabled: false

# Leave empty to use the Postgres deployed by the operator.
postgresHostEndpoint: ""
hubJwt is the runtime registry credential that Arize AI provides for pulling images from ch.hub.arize.com (or your mirror, if you set pullRegistry). It is not the same value as the JWT used for the one-time tarball download; Arize AI provides both, and arize.sh does not derive one from the other. To point at a managed or customer-provided Postgres instance instead, set postgresHostEndpoint to its hostname and follow External Postgres requirements for supported versions, sizing, parameters, and database initialization. For smaller bare-metal clusters, ask Arize AI which sizing profile to use. Do not use small1b or medium2b unless the nodes match the sizing requirements in Cluster sizing.

Small cluster values

Small Rancher, Talos, or homelab clusters often need extra values beyond the minimal example. These are not universal production defaults, but they are common when the cluster has a shared node pool or less memory than the standard sizing profiles expect.
Before using operator-managed MinIO on a small cluster, confirm the storage budget for your release. Some releases create eight 150 Gi PVCs for MinIO alone (two per replica), before the other Arize AX volumes are created. Longhorn or another replicated storage backend may need more physical disk than the PVC total. If that does not fit comfortably, use an external S3-compatible store with cloud: ceph, add storage, or work with Arize AI on a smaller storage plan before installing.
Discuss these with Arize AI before using them in production:
# Use a shared node pool instead of a dedicated ArizeDB pool.
historicalNodePoolEnabled: false

# Use node emptyDir for component scratch space instead of large default PVCs.
ephemeralMode: emptyDir

# Disable autosizing on small clusters: with autosizing on, the ArizeDB
# historical container can request 24 GiB of memory, which will not fit on
# a typical homelab node and the pod will stay Pending.
autoSizeMemory: false
autoSizeReplicas: false
Some small clusters also need ArizeDB historical JVM/resource tuning, toleration changes, or baseOverlay patches. Treat those as environment-specific overrides, not copy/paste defaults. If your nodes are tainted, make the Arize AX tolerations match the taints your platform uses. The toleration values are strings, so keep the list quoted:
podTolerationAll: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
podTolerationBase: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
podTolerationHist: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
# Only set this if legacy Spark is enabled.
podTolerationSpark: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
baseOverlay is a multiline YAML patch that the operator applies to Arize AX application manifests. Use it for targeted Arize AI-reviewed changes, such as changing a replica count or a container resource request. Paste it under baseOverlay: | exactly as provided:
baseOverlay: |
  ---
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: druid-compaction
    namespace: arize
  spec:
    replicas: 1
Do not put volumeClaimTemplates inside a baseOverlay. Kubernetes does not allow an existing StatefulSet’s volumeClaimTemplates to be changed. If this happens, the operator can stop reconciling with a Forbidden error. To use different PVC sizes, set the chart value before the first install when one is available. If the StatefulSet already exists, work with Arize AI on a cleanup and recreate plan for that component.

Encoding values

The secret fields in the example must be base64-encoded. Encode a short value with:
printf '%s' '<value>' | base64 | tr -d '\n'
Do not base64 encode clusterName, organizationName, bucket names, storage class names, registry hostnames, or appBaseUrl.

Run the install

Run commands from the extracted distribution directory:
./arize.sh -y -t 5400 install
Use a longer timeout on smaller clusters and on clusters that pull images from an external registry. Some distributions install the operator chart without creating the arize-operator namespace first, which causes arize.sh install to fail with:
namespaces "arize-operator" not found
Do not fix this by creating arize-operator with a plain kubectl create namespace command. Helm needs to own the namespace; a plain kubectl create namespace lacks the ownership annotations Helm expects, and a later upgrade can fail.
Workaround: in your local copy of the distribution, edit arize.sh so the operator-chart helm upgrade --install command passes --create-namespace. The two lines to edit are the operator-chart installs (search for arize-op in arize.sh); add --create-namespace after helm upgrade --install. Re-run ./arize.sh -y -t 5400 install. Helm creates the namespace with proper ownership metadata, so subsequent upgrades work.

Talos and PodSecurity notes

Talos and other bare-metal clusters often enforce the Kubernetes PodSecurity baseline policy, which rejects MinIO pods that bind hostPort: 9000. Check events when MinIO is not fully ready:
kubectl get events -n arize --sort-by=.lastTimestamp
kubectl get sts minio -n arize
Two event messages usually point to this issue:
  • violates PodSecurity "baseline" and mentions hostPort
  • didn't have free ports for the requested pod ports
The scheduling message can show up even after PodSecurity is relaxed. With hostPort: 9000 still set, Kubernetes can place only one MinIO pod per node. A two-node cluster cannot run all four MinIO replicas until hostPort is removed. The on-prem MinIO StatefulSet has two containers (mc and minio), and their order is not guaranteed across releases. Remove hostPort from whichever container is named minio:
IDX=$(kubectl get sts minio -n arize \
  -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' \
  | grep -n '^minio$' | head -1 | cut -d: -f1)
if [ -z "$IDX" ]; then
  echo "MinIO container not found; contact Arize for an updated patch."
else
  IDX=$((IDX - 1))
  kubectl patch sts -n arize minio --type='json' -p="[
    {\"op\":\"test\",\"path\":\"/spec/template/spec/containers/$IDX/name\",\"value\":\"minio\"},
    {\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$IDX/ports/0/hostPort\"}
  ]"
  kubectl rollout restart statefulset/minio -n arize || true
fi
The test operation guards against the StatefulSet layout changing in future releases. If the patch errors out, the StatefulSet is unchanged; contact Arize AI for an updated patch. After patching MinIO, continue checking the install status. Some releases re-render the MinIO StatefulSet and reintroduce hostPort, so repeat the patch if MinIO drops back to 3/4 and events mention requested pod ports. MinIO is a four-replica distributed StatefulSet, and the cluster cannot serve API requests or create buckets until all four replicas are Ready.

Promtail and PodSecurity baseline

The Arize AX chart deploys a promtail DaemonSet that mounts node hostPath volumes (/var/log/pods, /var/lib/docker/containers, and similar) to ship pod logs to Loki. The PodSecurity baseline policy rejects hostPath volumes, so on Talos, kyverno-enforced clusters, and other restricted-PSA environments no promtail pod can schedule. Look for promtail in kubectl get events -n arize:
pods "promtail-xxxxx" is forbidden: violates PodSecurity "baseline:latest":
  hostPath volumes (volumes "run", "containers", "pods")
Promtail is not in the install-blocking path. The rest of Arize AX installs and runs without it, but in-cluster log shipping to Loki will be missing until you address the policy. The simplest fix is to relax PSA enforcement for the arize namespace:
kubectl label ns arize pod-security.kubernetes.io/enforce=privileged --overwrite
kubectl rollout restart daemonset/promtail -n arize
This relaxes PodSecurity for the arize namespace as a whole, not just promtail. If your platform team requires a tighter scope, use a kyverno PolicyException (or your policy engine’s equivalent) targeting only the promtail DaemonSet’s hostPath volumes instead.

Check the install

After arize.sh install finishes, do not rely only on the shell exit code. Confirm the operator and pods are healthy:
kubectl get pods -n arize-operator
kubectl get pod arize-op-arize-operator-operator-0 -n arize-operator \
  -o jsonpath='{.metadata.annotations.operator\.arize\.com/status}{"\n"}'
kubectl get pods -n arize
kubectl get jobs -n arize
./arize.sh -y install-status can also be useful when Arize AI Support asks for a deeper status check, but it prints the startup configuration, including secret values such as hubJwt, and it can produce a lot of output. Redact the output before sharing it. Useful operator statuses:
StatusMeaning
ExecutingThe operator is rendering and applying manifests.
InstallingCore install jobs and dependencies are still starting.
DelayedThe operator is waiting for active install jobs to finish before it can reconcile. If a job is stuck Running for an unreasonable time, see Troubleshooting failed or stuck jobs.
RunningThe operator reports the deployment is complete.
ErrorCheck operator logs and failed jobs.
A successful install should leave the Arize AX Operator running in arize-operator, all install jobs completed in arize, and application pods running.

Fresh reinstall cleanup

This procedure removes an Arize AX install from any Kubernetes cluster, including managed cloud, OpenShift, Rancher, and bare metal, so you can reinstall from scratch.
Use this only when you are intentionally discarding the previous Arize AX install. The PVC deletion below removes in-cluster Arize AX data. Only run this when data loss is acceptable or you have a backup and restore plan.
Run from a workstation with kubectl and helm configured for the target cluster:
helm uninstall arize-op -n arize-operator || true

kubectl delete all,cm,secret,pvc,sa,role,rolebinding,cronjob,job \
  -n arize --all --ignore-not-found=true --wait=false || true
kubectl delete all,cm,secret,pvc,sa,role,rolebinding,job \
  -n arize-operator --all --ignore-not-found=true --wait=false || true

kubectl delete ns arize --wait=false || true
kubectl delete ns arize-operator --wait=false || true
kubectl wait --for=delete ns/arize --timeout=300s || true
kubectl wait --for=delete ns/arize-operator --timeout=300s || true
helm uninstall removes the operator chart’s cluster-scoped resources (such as the Arize AX Prometheus node ClusterRole and ClusterRoleBinding) automatically. The chart applies a helm.sh/resource-policy: keep annotation to both namespaces, so the explicit kubectl delete ns calls above are required. Some Arize AX pods use a long graceful termination window. Gazette, for example, sets terminationGracePeriodSeconds: 1500 (25 minutes). If namespace deletion stays in Terminating for more than a few minutes, force-delete any stuck pods to unblock it:
kubectl get pods -n arize -o name \
  | xargs -r kubectl delete -n arize --grace-period=0 --force
If a namespace is still stuck after its pods are gone, remove the namespace finalizer. Use this as a last resort: it bypasses controllers that may still be reconciling resources.
kubectl get ns arize -o json \
  | jq 'del(.spec.finalizers)' \
  | kubectl replace --raw /api/v1/namespaces/arize/finalize -f -
Before reinstalling, verify both namespaces are gone:
kubectl get ns arize || true
kubectl get ns arize-operator || true
Verify no leftover pods or resources exist in the recreated namespace before continuing. If the cluster uses Longhorn, also check for old Arize AX volumes before reinstalling. Repeated test installs can leave detached Longhorn volumes even after Kubernetes namespaces and PVCs are gone, and those volumes still consume disk:
kubectl get volumes.longhorn.io -n longhorn-system
Delete only volumes that clearly belong to the previous Arize AX install and whose data you intend to discard. Do not delete volumes for other namespaces or other applications.

Troubleshooting failed or stuck jobs

If the operator reports Error, Delayed, or says bad jobs exist, check jobs and logs:
kubectl get jobs -n arize

for job in install-postgres-init install-minio-init install-gazette-init install-druid-init; do
  echo "===== $job ====="
  pod=$(kubectl get pod -n arize -l job-name=$job -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true)
  [ -n "$pod" ] && kubectl logs -n arize "$pod" --tail=120
done
A Failed job blocks retry. So does a Running job that has been waiting for hours on a dependency that has since recovered. The init pod will not reconnect, and the operator stays Delayed reconciling around it. In either case, delete the Arize AX-owned job and re-apply:
kubectl delete job <failed-or-stuck-job-name> -n arize || true
./arize.sh -y apply
The operator recreates fresh init jobs on the next reconcile. The most common stuck job is install-minio-init, which loops Waiting to create local alias... if MinIO was unhealthy at job start time.

Log safety

arize.sh prints secret values such as hubJwt in the startup configuration block at the top of every install run. Redact install logs before sharing them with anyone outside your team.