Use this page when Kubernetes is already running and is managed outside GKE, EKS, AKS, or OpenShift. This includes Rancher-managed clusters, Talos clusters, and other bare-metal or private Kubernetes environments. Your platform team owns the cluster-level choices: storage classes, ingress, DNS, registry access, and object storage. The Arize AX install flow is still the same: pointDocumentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
kubectl at the target cluster, create values.yaml next to arize.sh, then run ./arize.sh install from the extracted distribution directory.
Before you start
You need:- A working kubeconfig for the target cluster.
- Block-storage-backed persistent volumes. Do not use NFS-backed volumes for Arize AX persistent volumes.
- Object storage for Gazette and ArizeDB data. For bare-metal installs this is usually MinIO, Ceph, or another S3-compatible service.
- A storage class name for standard persistent volumes and SSD-style persistent volumes. They can be the same storage class if the cluster does not split storage tiers.
- The application URL you plan to expose Arize AX at, for
appBaseUrl. DNS does not need to resolve at install time; you can wire DNS and ingress afterwards. Pick the URL you intend to keep, though: OAuth callbacks, application redirects, and the configmap rendered by the operator all referenceappBaseUrl, so changing it later means re-rendering the install configuration. - The release version from On-Premise Releases, plus distribution access, organization name, and sizing profile from Arize AI.
clusterName to the cluster entry from kubectl config get-clusters. Do not assume it is always the same as the current context name.
Decide whether this is an upgrade or a reset
If Arize AX is already installed, running./arize.sh install again is the normal path for a refresh, a redeploy of operator-managed manifests, or an upgrade with your current values.yaml.
Use Fresh reinstall cleanup only when you intentionally want to discard the existing install and reinstall from an empty target.
./arize.sh install or reset the target. A reset deletes in-cluster Arize AX data unless you have a backup/restore plan.
Choose the storage mode
For most Rancher and bare-metal installs, start with one of these values:| Environment | cloud value | Object storage |
|---|---|---|
| Operator-managed MinIO in the cluster | minio | MinIO deployed with Arize AX |
| Existing Ceph or S3-compatible endpoint | ceph | External S3-compatible service |
minio when you want the Arize AX install to manage MinIO in the cluster. Use ceph when your platform team already provides an S3-compatible endpoint and credentials.
With cloud: minio, the operator’s install-minio-init job creates gazetteBucket and druidBucket inside the operator-managed MinIO during the install. Pick names that are unique to your install (they only need to be unique within this MinIO); you do not pre-provision them. With cloud: ceph or another S3-compatible endpoint, pre-provision both buckets in your platform’s storage and grant the install credentials read/write access.
Create values.yaml
Create values.yaml in the extracted distribution directory, next to arize.sh.
This minimal example is for a bare-metal cluster using operator-managed MinIO:
hubJwt is the runtime registry credential that Arize AI provides for pulling images from ch.hub.arize.com (or your mirror, if you set pullRegistry). It is not the same value as the JWT used for the one-time tarball download; Arize AI provides both, and arize.sh does not derive one from the other.
To point at a managed or customer-provided Postgres instance instead, set postgresHostEndpoint to its hostname and follow External Postgres requirements for supported versions, sizing, parameters, and database initialization.
For smaller bare-metal clusters, ask Arize AI which sizing profile to use. Do not use small1b or medium2b unless the nodes match the sizing requirements in Cluster sizing.
Small cluster values
Small Rancher, Talos, or homelab clusters often need extra values beyond the minimal example. These are not universal production defaults, but they are common when the cluster has a shared node pool or less memory than the standard sizing profiles expect.Before using operator-managed MinIO on a small cluster, confirm the storage budget for your release. Some releases create eight 150 Gi PVCs for MinIO alone (two per replica), before the other Arize AX volumes are created. Longhorn or another replicated storage backend may need more physical disk than the PVC total. If that does not fit comfortably, use an external S3-compatible store with
cloud: ceph, add storage, or work with Arize AI on a smaller storage plan before installing.baseOverlay patches. Treat those as environment-specific overrides, not copy/paste defaults.
If your nodes are tainted, make the Arize AX tolerations match the taints your platform uses. The toleration values are strings, so keep the list quoted:
baseOverlay is a multiline YAML patch that the operator applies to Arize AX application manifests. Use it for targeted Arize AI-reviewed changes, such as changing a replica count or a container resource request. Paste it under baseOverlay: | exactly as provided:
Encoding values
The secret fields in the example must be base64-encoded. Encode a short value with:clusterName, organizationName, bucket names, storage class names, registry hostnames, or appBaseUrl.
Run the install
Run commands from the extracted distribution directory:arize-operator namespace first, which causes arize.sh install to fail with:
arize.sh so the operator-chart helm upgrade --install command passes --create-namespace. The two lines to edit are the operator-chart installs (search for arize-op in arize.sh); add --create-namespace after helm upgrade --install. Re-run ./arize.sh -y -t 5400 install. Helm creates the namespace with proper ownership metadata, so subsequent upgrades work.
Talos and PodSecurity notes
Talos and other bare-metal clusters often enforce the Kubernetes PodSecuritybaseline policy, which rejects MinIO pods that bind hostPort: 9000. Check events when MinIO is not fully ready:
violates PodSecurity "baseline"and mentionshostPortdidn't have free ports for the requested pod ports
hostPort: 9000 still set, Kubernetes can place only one MinIO pod per node. A two-node cluster cannot run all four MinIO replicas until hostPort is removed.
The on-prem MinIO StatefulSet has two containers (mc and minio), and their order is not guaranteed across releases. Remove hostPort from whichever container is named minio:
test operation guards against the StatefulSet layout changing in future releases. If the patch errors out, the StatefulSet is unchanged; contact Arize AI for an updated patch.
After patching MinIO, continue checking the install status. Some releases re-render the MinIO StatefulSet and reintroduce hostPort, so repeat the patch if MinIO drops back to 3/4 and events mention requested pod ports. MinIO is a four-replica distributed StatefulSet, and the cluster cannot serve API requests or create buckets until all four replicas are Ready.
Promtail and PodSecurity baseline
The Arize AX chart deploys apromtail DaemonSet that mounts node hostPath volumes (/var/log/pods, /var/lib/docker/containers, and similar) to ship pod logs to Loki. The PodSecurity baseline policy rejects hostPath volumes, so on Talos, kyverno-enforced clusters, and other restricted-PSA environments no promtail pod can schedule. Look for promtail in kubectl get events -n arize:
arize namespace:
This relaxes PodSecurity for the
arize namespace as a whole, not just promtail. If your platform team requires a tighter scope, use a kyverno PolicyException (or your policy engine’s equivalent) targeting only the promtail DaemonSet’s hostPath volumes instead.Check the install
Afterarize.sh install finishes, do not rely only on the shell exit code. Confirm the operator and pods are healthy:
./arize.sh -y install-status can also be useful when Arize AI Support asks for a deeper status check, but it prints the startup configuration, including secret values such as hubJwt, and it can produce a lot of output. Redact the output before sharing it.
Useful operator statuses:
| Status | Meaning |
|---|---|
Executing | The operator is rendering and applying manifests. |
Installing | Core install jobs and dependencies are still starting. |
Delayed | The operator is waiting for active install jobs to finish before it can reconcile. If a job is stuck Running for an unreasonable time, see Troubleshooting failed or stuck jobs. |
Running | The operator reports the deployment is complete. |
Error | Check operator logs and failed jobs. |
arize-operator, all install jobs completed in arize, and application pods running.
Fresh reinstall cleanup
This procedure removes an Arize AX install from any Kubernetes cluster, including managed cloud, OpenShift, Rancher, and bare metal, so you can reinstall from scratch. Run from a workstation withkubectl and helm configured for the target cluster:
helm uninstall removes the operator chart’s cluster-scoped resources (such as the Arize AX Prometheus node ClusterRole and ClusterRoleBinding) automatically. The chart applies a helm.sh/resource-policy: keep annotation to both namespaces, so the explicit kubectl delete ns calls above are required.
Some Arize AX pods use a long graceful termination window. Gazette, for example, sets terminationGracePeriodSeconds: 1500 (25 minutes). If namespace deletion stays in Terminating for more than a few minutes, force-delete any stuck pods to unblock it:
Troubleshooting failed or stuck jobs
If the operator reportsError, Delayed, or says bad jobs exist, check jobs and logs:
Failed job blocks retry. So does a Running job that has been waiting for hours on a dependency that has since recovered. The init pod will not reconnect, and the operator stays Delayed reconciling around it. In either case, delete the Arize AX-owned job and re-apply:
install-minio-init, which loops Waiting to create local alias... if MinIO was unhealthy at job start time.