> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Installation on Rancher (bare metal)

> Deploy self-hosted Arize AX on Kubernetes managed with Rancher on bare metal: cluster prep, distribution install, ingress, and validation.

Use this page when Kubernetes is already running and is managed outside GKE, EKS, AKS, or OpenShift. This includes Rancher-managed clusters, Talos clusters, and other bare-metal or private Kubernetes environments.

Your platform team owns the cluster-level choices: storage classes, ingress, DNS, registry access, and object storage. The Arize AX install flow is still the same: point `kubectl` at the target cluster, create `values.yaml` next to `arize.sh`, then run `./arize.sh install` from the extracted distribution directory.

## Before you start

You need:

* A working kubeconfig for the target cluster.
* Block-storage-backed persistent volumes. Do not use NFS-backed volumes for Arize AX persistent volumes.
* Object storage for Gazette and ArizeDB data. For bare-metal installs this is usually MinIO, Ceph, or another S3-compatible service.
* A storage class name for standard persistent volumes and SSD-style persistent volumes. They can be the same storage class if the cluster does not split storage tiers.
* The application URL you plan to expose Arize AX at, for `appBaseUrl`. DNS does not need to resolve at install time; you can wire DNS and ingress afterwards. Pick the URL you intend to keep, though: OAuth callbacks, application redirects, and the configmap rendered by the operator all reference `appBaseUrl`, so changing it later means re-rendering the install configuration.
* The release version from [On-Premise Releases](https://arize.com/docs/ax/selfhosting/guides/releases), plus distribution access, organization name, and sizing profile from Arize AI.

<Warning>
  Use dedicated namespaces for Arize AX. The examples use `arize` and `arize-operator`; if you choose different names, keep them dedicated to this Arize AX install. Do not install Arize AX into a namespace shared with other applications. Cleanup and reinstall commands can delete namespace resources and are not safe for shared namespaces.
</Warning>

If you are using Rancher or Talos, confirm which kubeconfig cluster entry the installer should use:

```bash theme={null}
kubectl config get-clusters
kubectl config current-context
```

Set `clusterName` to the cluster entry from `kubectl config get-clusters`. Do not assume it is always the same as the current context name.

## Decide whether this is an upgrade or a reset

If Arize AX is already installed, running `./arize.sh install` again is the normal path for a refresh, a redeploy of operator-managed manifests, or an upgrade with your current `values.yaml`.

<Note>
  Use [Fresh reinstall cleanup](#fresh-reinstall-cleanup) only when you intentionally want to discard the existing install and reinstall from an empty target.
</Note>

If a previous install failed partway through, check the operator, jobs, and pods before deciding whether to continue with `./arize.sh install` or reset the target. A reset deletes in-cluster Arize AX data unless you have a backup/restore plan.

## Choose the storage mode

For most Rancher and bare-metal installs, start with one of these values:

| Environment                             | `cloud` value | Object storage                 |
| --------------------------------------- | ------------- | ------------------------------ |
| Operator-managed MinIO in the cluster   | `minio`       | MinIO deployed with Arize AX   |
| Existing Ceph or S3-compatible endpoint | `ceph`        | External S3-compatible service |

Use `minio` when you want the Arize AX install to manage MinIO in the cluster. Use `ceph` when your platform team already provides an S3-compatible endpoint and credentials.

With `cloud: minio`, the operator's `install-minio-init` job creates `gazetteBucket` and `druidBucket` inside the operator-managed MinIO during the install. Pick names that are unique to your install (they only need to be unique within this MinIO); you do not pre-provision them. With `cloud: ceph` or another S3-compatible endpoint, pre-provision both buckets in your platform's storage and grant the install credentials read/write access.

## Create `values.yaml`

Create `values.yaml` in the extracted distribution directory, next to `arize.sh`.

This minimal example is for a bare-metal cluster using operator-managed MinIO:

```yaml theme={null}
helmNamespace: arize-operator
cloud: "minio"
clusterName: "<cluster-entry-from-kubectl-config-get-clusters>"
organizationName: "<organization-name>"
clusterSizing: "<sizing-profile>"

gazetteBucket: "<gazette-bucket-name>"
druidBucket: "<adb-bucket-name>"

hubJwt: "<base64-encoded-runtime-registry-jwt>"
postgresPassword: "<base64-encoded-postgres-password>"
cipherKey: "<base64-encoded-cipher-key>"
minioPassword: "<base64-encoded-minio-password>"

appBaseUrl: "https://<arize-app-domain>"

storageClassMinioStandard: "<standard-storage-class>"
storageClassMinioSsd: "<ssd-storage-class>"

# Set this to false if the cluster does not have a separate ArizeDB node pool.
historicalNodePoolEnabled: false

# Leave empty to use the Postgres deployed by the operator.
postgresHostEndpoint: ""
```

`hubJwt` is the runtime registry credential that Arize AI provides for pulling images from `ch.hub.arize.com` (or your mirror, if you set `pullRegistry`). It is not the same value as the JWT used for the one-time tarball download; Arize AI provides both, and `arize.sh` does not derive one from the other.

To point at a managed or customer-provided Postgres instance instead, set `postgresHostEndpoint` to its hostname and follow [External Postgres requirements](/ax/selfhosting/installation/external-postgres-requirements) for supported versions, sizing, parameters, and database initialization.

For smaller bare-metal clusters, ask Arize AI which sizing profile to use. Do not use `small1b` or `medium2b` unless the nodes match the sizing requirements in [Cluster sizing](/ax/selfhosting/getting-started/prerequisites#cluster-sizing).

## Small cluster values

Small Rancher, Talos, or homelab clusters often need extra values beyond the minimal example. These are not universal production defaults, but they are common when the cluster has a shared node pool or less memory than the standard sizing profiles expect.

<Note>
  Before using operator-managed MinIO on a small cluster, confirm the storage budget for your release. Some releases create eight 150 Gi PVCs for MinIO alone (two per replica), before the other Arize AX volumes are created. Longhorn or another replicated storage backend may need more physical disk than the PVC total. If that does not fit comfortably, use an external S3-compatible store with `cloud: ceph`, add storage, or work with Arize AI on a smaller storage plan before installing.
</Note>

Discuss these with Arize AI before using them in production:

```yaml theme={null}
# Use a shared node pool instead of a dedicated ArizeDB pool.
historicalNodePoolEnabled: false

# Use node emptyDir for component scratch space instead of large default PVCs.
ephemeralMode: emptyDir

# Disable autosizing on small clusters: with autosizing on, the ArizeDB
# historical container can request 24 GiB of memory, which will not fit on
# a typical homelab node and the pod will stay Pending.
autoSizeMemory: false
autoSizeReplicas: false
```

Some small clusters also need ArizeDB historical JVM/resource tuning, toleration changes, or `baseOverlay` patches. Treat those as environment-specific overrides, not copy/paste defaults.

If your nodes are tainted, make the Arize AX tolerations match the taints your platform uses. The toleration values are strings, so keep the list quoted:

```yaml theme={null}
podTolerationAll: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
podTolerationBase: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
podTolerationHist: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
# Only set this if legacy Spark is enabled.
podTolerationSpark: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
```

`baseOverlay` is a multiline YAML patch that the operator applies to Arize AX application manifests. Use it for targeted Arize AI-reviewed changes, such as changing a replica count or a container resource request. Paste it under `baseOverlay: |` exactly as provided:

```yaml theme={null}
baseOverlay: |
  ---
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: druid-compaction
    namespace: arize
  spec:
    replicas: 1
```

<Warning>
  Do not put `volumeClaimTemplates` inside a `baseOverlay`. Kubernetes does not allow an existing StatefulSet's `volumeClaimTemplates` to be changed. If this happens, the operator can stop reconciling with a `Forbidden` error. To use different PVC sizes, set the chart value before the first install when one is available. If the StatefulSet already exists, work with Arize AI on a cleanup and recreate plan for that component.
</Warning>

## Encoding values

The secret fields in the example must be base64-encoded. Encode a short value with:

```bash theme={null}
printf '%s' '<value>' | base64 | tr -d '\n'
```

Do not base64 encode `clusterName`, `organizationName`, bucket names, storage class names, registry hostnames, or `appBaseUrl`.

## Run the install

Run commands from the extracted distribution directory:

```bash theme={null}
./arize.sh -y -t 5400 install
```

Use a longer timeout on smaller clusters and on clusters that pull images from an external registry.

Some distributions install the operator chart without creating the `arize-operator` namespace first, which causes `arize.sh install` to fail with:

```text theme={null}
namespaces "arize-operator" not found
```

<Warning>
  Do not fix this by creating `arize-operator` with a plain `kubectl create namespace` command. Helm needs to own the namespace; a plain `kubectl create` namespace lacks the ownership annotations Helm expects, and a later upgrade can fail.
</Warning>

Workaround: in your local copy of the distribution, edit `arize.sh` so the operator-chart `helm upgrade --install` command passes `--create-namespace`. The two lines to edit are the operator-chart installs (search for `arize-op` in `arize.sh`); add `--create-namespace` after `helm upgrade --install`. Re-run `./arize.sh -y -t 5400 install`. Helm creates the namespace with proper ownership metadata, so subsequent upgrades work.

## Talos and PodSecurity notes

Talos and other bare-metal clusters often enforce the Kubernetes PodSecurity `baseline` policy, which rejects MinIO pods that bind `hostPort: 9000`. Check events when MinIO is not fully ready:

```bash theme={null}
kubectl get events -n arize --sort-by=.lastTimestamp
kubectl get sts minio -n arize
```

Two event messages usually point to this issue:

* `violates PodSecurity "baseline"` and mentions `hostPort`
* `didn't have free ports for the requested pod ports`

The scheduling message can show up even after PodSecurity is relaxed. With `hostPort: 9000` still set, Kubernetes can place only one MinIO pod per node. A two-node cluster cannot run all four MinIO replicas until `hostPort` is removed.

The on-prem MinIO StatefulSet has two containers (`mc` and `minio`), and their order is not guaranteed across releases. Remove `hostPort` from whichever container is named `minio`:

```bash theme={null}
IDX=$(kubectl get sts minio -n arize \
  -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' \
  | grep -n '^minio$' | head -1 | cut -d: -f1)
if [ -z "$IDX" ]; then
  echo "MinIO container not found; contact Arize for an updated patch."
else
  IDX=$((IDX - 1))
  kubectl patch sts -n arize minio --type='json' -p="[
    {\"op\":\"test\",\"path\":\"/spec/template/spec/containers/$IDX/name\",\"value\":\"minio\"},
    {\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$IDX/ports/0/hostPort\"}
  ]"
  kubectl rollout restart statefulset/minio -n arize || true
fi
```

The `test` operation guards against the StatefulSet layout changing in future releases. If the patch errors out, the StatefulSet is unchanged; contact Arize AI for an updated patch.

After patching MinIO, continue checking the install status. Some releases re-render the MinIO StatefulSet and reintroduce `hostPort`, so repeat the patch if MinIO drops back to `3/4` and events mention requested pod ports. MinIO is a four-replica distributed StatefulSet, and the cluster cannot serve API requests or create buckets until all four replicas are `Ready`.

### Promtail and PodSecurity baseline

The Arize AX chart deploys a `promtail` DaemonSet that mounts node `hostPath` volumes (`/var/log/pods`, `/var/lib/docker/containers`, and similar) to ship pod logs to Loki. The PodSecurity `baseline` policy rejects `hostPath` volumes, so on Talos, kyverno-enforced clusters, and other restricted-PSA environments no `promtail` pod can schedule. Look for `promtail` in `kubectl get events -n arize`:

```text theme={null}
pods "promtail-xxxxx" is forbidden: violates PodSecurity "baseline:latest":
  hostPath volumes (volumes "run", "containers", "pods")
```

Promtail is not in the install-blocking path. The rest of Arize AX installs and runs without it, but in-cluster log shipping to Loki will be missing until you address the policy. The simplest fix is to relax PSA enforcement for the `arize` namespace:

```bash theme={null}
kubectl label ns arize pod-security.kubernetes.io/enforce=privileged --overwrite
kubectl rollout restart daemonset/promtail -n arize
```

<Note>
  This relaxes PodSecurity for the `arize` namespace as a whole, not just promtail. If your platform team requires a tighter scope, use a kyverno `PolicyException` (or your policy engine's equivalent) targeting only the promtail DaemonSet's hostPath volumes instead.
</Note>

## Check the install

After `arize.sh install` finishes, do not rely only on the shell exit code. Confirm the operator and pods are healthy:

```bash theme={null}
kubectl get pods -n arize-operator
kubectl get pod arize-op-arize-operator-operator-0 -n arize-operator \
  -o jsonpath='{.metadata.annotations.operator\.arize\.com/status}{"\n"}'
kubectl get pods -n arize
kubectl get jobs -n arize
```

`./arize.sh -y install-status` can also be useful when Arize AI Support asks for a deeper status check, but it prints the startup configuration, including secret values such as `hubJwt`, and it can produce a lot of output. Redact the output before sharing it.

Useful operator statuses:

| Status       | Meaning                                                                                                                                                                                                                   |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Executing`  | The operator is rendering and applying manifests.                                                                                                                                                                         |
| `Installing` | Core install jobs and dependencies are still starting.                                                                                                                                                                    |
| `Delayed`    | The operator is waiting for active install jobs to finish before it can reconcile. If a job is stuck Running for an unreasonable time, see [Troubleshooting failed or stuck jobs](#troubleshooting-failed-or-stuck-jobs). |
| `Running`    | The operator reports the deployment is complete.                                                                                                                                                                          |
| `Error`      | Check operator logs and failed jobs.                                                                                                                                                                                      |

A successful install should leave the Arize AX Operator running in `arize-operator`, all install jobs completed in `arize`, and application pods running.

## Fresh reinstall cleanup

This procedure removes an Arize AX install from any Kubernetes cluster, including managed cloud, OpenShift, Rancher, and bare metal, so you can reinstall from scratch.

<Warning>
  Use this only when you are intentionally discarding the previous Arize AX install. The PVC deletion below removes in-cluster Arize AX data. Only run this when data loss is acceptable or you have a backup and restore plan.
</Warning>

Run from a workstation with `kubectl` and `helm` configured for the target cluster:

```bash theme={null}
helm uninstall arize-op -n arize-operator || true

kubectl delete all,cm,secret,pvc,sa,role,rolebinding,cronjob,job \
  -n arize --all --ignore-not-found=true --wait=false || true
kubectl delete all,cm,secret,pvc,sa,role,rolebinding,job \
  -n arize-operator --all --ignore-not-found=true --wait=false || true

kubectl delete ns arize --wait=false || true
kubectl delete ns arize-operator --wait=false || true
kubectl wait --for=delete ns/arize --timeout=300s || true
kubectl wait --for=delete ns/arize-operator --timeout=300s || true
```

`helm uninstall` removes the operator chart's cluster-scoped resources (such as the Arize AX Prometheus node `ClusterRole` and `ClusterRoleBinding`) automatically. The chart applies a `helm.sh/resource-policy: keep` annotation to both namespaces, so the explicit `kubectl delete ns` calls above are required.

Some Arize AX pods use a long graceful termination window. Gazette, for example, sets `terminationGracePeriodSeconds: 1500` (25 minutes). If namespace deletion stays in `Terminating` for more than a few minutes, force-delete any stuck pods to unblock it:

```bash theme={null}
kubectl get pods -n arize -o name \
  | xargs -r kubectl delete -n arize --grace-period=0 --force
```

If a namespace is still stuck after its pods are gone, remove the namespace finalizer. Use this as a last resort: it bypasses controllers that may still be reconciling resources.

```bash theme={null}
kubectl get ns arize -o json \
  | jq 'del(.spec.finalizers)' \
  | kubectl replace --raw /api/v1/namespaces/arize/finalize -f -
```

Before reinstalling, verify both namespaces are gone:

```bash theme={null}
kubectl get ns arize || true
kubectl get ns arize-operator || true
```

Verify no leftover pods or resources exist in the recreated namespace before continuing.

If the cluster uses Longhorn, also check for old Arize AX volumes before reinstalling. Repeated test installs can leave detached Longhorn volumes even after Kubernetes namespaces and PVCs are gone, and those volumes still consume disk:

```bash theme={null}
kubectl get volumes.longhorn.io -n longhorn-system
```

Delete only volumes that clearly belong to the previous Arize AX install and whose data you intend to discard. Do not delete volumes for other namespaces or other applications.

## Troubleshooting failed or stuck jobs

If the operator reports `Error`, `Delayed`, or says bad jobs exist, check jobs and logs:

```bash theme={null}
kubectl get jobs -n arize

for job in install-postgres-init install-minio-init install-gazette-init install-druid-init; do
  echo "===== $job ====="
  pod=$(kubectl get pod -n arize -l job-name=$job -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true)
  [ -n "$pod" ] && kubectl logs -n arize "$pod" --tail=120
done
```

A `Failed` job blocks retry. So does a `Running` job that has been waiting for hours on a dependency that has since recovered. The init pod will not reconnect, and the operator stays `Delayed` reconciling around it. In either case, delete the Arize AX-owned job and re-apply:

```bash theme={null}
kubectl delete job <failed-or-stuck-job-name> -n arize || true
./arize.sh -y apply
```

The operator recreates fresh init jobs on the next reconcile. The most common stuck job is `install-minio-init`, which loops `Waiting to create local alias...` if MinIO was unhealthy at job start time.

## Log safety

<Warning>
  `arize.sh` prints secret values such as `hubJwt` in the startup configuration block at the top of every install run. Redact install logs before sharing them with anyone outside your team.
</Warning>
