Self-hosted Arize AX installs on Kubernetes using the distribution archive (Helm operator chart,Documentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
arize.sh, Terraform, and offline docs/), your values.yaml, and your platform’s networking and storage.
Before you install
- Installation flow — Operator, Helm, and how the pieces connect.
- Prerequisites — sizing, pools, storage, and tools.
- Download and unpack the distribution — obtain and extract the release tarball.
Install by platform
- GCP
- Azure
- AWS
- OpenShift
- Single host
- Rancher (bare metal)
Overview
The diagram below depicts the target topology for GCP.
Throughout this section, distribution archive means the Arize AX release tarball you download with your JWT (Helm chart, arize.sh, Terraform, examples, and offline HTML docs).Read the ordered steps below once so you know the sequence (download → cluster → install → ingress → validate). Complete Prerequisites, including deployment types for your network model, before you start. Then open each linked page below; you can return here anytime as the hub.For the full Helm reference, use Advanced → Helm in the HTML documentation under docs/ inside your extracted distribution archive.Download the distribution
Obtain the distribution archive with your JWT, extract it, and review the Helm chart, scripts, Terraform, and example manifests included in the bundle.
Download and extract
JWT,
get_latest.sh, folder layout, and what to open next.GKE cluster and infrastructure
Bring an existing GKE cluster and supporting GCP resources in line with Arize AX, or use the Terraform modules shipped in the distribution archive to provision them.
Existing cluster
Node pools, labels, GCS buckets, IAM, and Workload Identity.
Terraform
Use bundled modules and README; align static IPs and outputs with ingress.
Install Arize AX
Configure
values.yaml for GCP and install the operator chart.Quick start
Use when: you want a complete example
values.yaml and a single Helm command to get running fast.Detailed walkthrough
Use when: you need every field explained, base64 secrets, optional mirroring, and
./arize.sh versus Helm called out step by step.Create ingress
Expose the UI and APIs with a Google Cloud HTTP(S) load balancer and the GCP examples, or use NGINX, Istio, Kong, or another controller with the bundled example manifests.
GCP load balancer
Static IPs,
examples/endpoints/gcp, and appBaseUrl / expBaseUrl.NGINX, Istio, Kong, …
Example paths under
examples/endpoints and ingressMode guidance.Validate the deployment
Confirm pods are healthy, ingress and DNS resolve, and the UI is reachable before sending data.
Validation checklist
kubectl health checks, ingress inspection, login, and SDK testing.Use Arize AX
Open the public Arize AX documentation for product guides and configure your application with the on-prem SDK documentation.
Arize AX documentation
Product documentation, quickstarts, and observability guides on docs.arize.com.
SDK usage (self-hosted)
Python SDK versions and endpoint configuration for self-hosted deployments.
Integrations
Framework and provider integrations for tracing and monitoring.
Overview
The diagram below depicts the target topology for Azure.
Throughout this section, distribution archive means the Arize AX release tarball you download with your JWT (Helm chart, arize.sh, Terraform, examples, and offline HTML docs).Read the ordered steps below once so you know the sequence (download → cluster → install → ingress → validate). Complete Prerequisites, including deployment types for your network model, before you start. Then open each linked page below; you can return here anytime as the hub.For the full Helm reference, use Advanced → Helm in the HTML documentation under docs/ inside your extracted distribution archive.Download the distribution
Obtain the distribution archive with your JWT, extract it, and review the Helm chart, scripts, Terraform, and example manifests included in the bundle.
Download and extract
JWT,
get_latest.sh, folder layout, and what to open next.AKS cluster and infrastructure
Bring an existing AKS cluster and supporting Azure resources in line with Arize AX, or use the Terraform modules shipped in the distribution archive to provision them.
Existing cluster
Node pools, labels, Blob Storage containers, managed identity, and storage access.
Terraform
Use bundled modules and README; align static IPs and outputs with ingress.
Install Arize AX
Configure
values.yaml for Azure and install the operator chart.Quick start
Use when: you want a complete example
values.yaml and a single Helm command to get running fast.Detailed walkthrough
Use when: you need every field explained, base64 secrets, optional mirroring, and
./arize.sh versus Helm called out step by step.Create ingress
Expose the UI and APIs with an Azure load balancer and the bundled examples, or use NGINX, Istio, Kong, or another controller with the bundled example manifests.
Azure load balancer
Static IPs,
examples/endpoints/azure, and appBaseUrl / expBaseUrl.NGINX, Istio, Kong, …
Example paths under
examples/endpoints and ingressMode guidance.Validate the deployment
Confirm pods are healthy, ingress and DNS resolve, and the UI is reachable before sending data.
Validation checklist
kubectl health checks, ingress inspection, login, and SDK testing.Use Arize AX
Open the public Arize AX documentation for product guides and configure your application with the on-prem SDK documentation.
Arize AX documentation
Product documentation, quickstarts, and observability guides on docs.arize.com.
SDK usage (self-hosted)
Python SDK versions and endpoint configuration for self-hosted deployments.
Integrations
Framework and provider integrations for tracing and monitoring.
The diagram below depicts the target topology for AWS.
Requirements for AWS:Choose an approach based on the deployment. For helm:For the complete installation walkthrough, consult Advanced → Helm in the documentation included with your distribution.
Requirements for AWS:- Two S3 storage buckets for Gazette and ArizeDB data.
- Buckets can be configured to use AES256, KMS, or no encryption.
- An EKS cluster with a minimum of two node pools: base pool and ArizeDB pool.
-
The base node pool should be labeled with
arize=trueandarize-base=true. -
The ArizeDB node pool should be labeled with
arize=trueanddruid-historical=true. -
Storage classes
gp2is preferred and used by default. - An ECR or docker registry is optional as Arize AX pulls images from Arize AI’s central image registry by default.
-
Namespaces
arize,arize-operator, andarize-sparkcan be pre-existing or created later by the helm chart. -
If deployed on a private VPC, these endpoints must be accessible from the cluster:
- com.amazonaws..s3
- com.amazonaws..ecr.api
- com.amazonaws..ecr.dkr
- com.amazonaws..ec2
- com.amazonaws..elasticloadbalancing
- com.amazonaws..sts
- com.amazonaws..ebs
-
An IAM role with the following policy actions on the Arize AX ArizeDB and Gazette buckets:
- s3:ListBucket
- s3:*Object
- kms:Encrypt
- kms:Decrypt
- kms:ReEncrypt*
- kms:GenerateDataKey*
- kms:DescribeKey
- bedrock:InvokeModel
-
If using IAM roles for service accounts (IRSA):
-
The roles must have a trust policy that allows these service accounts to assume the role:
- system:serviceaccount:arize:*
- system:serviceaccount:arize-spark:*
- system:serviceaccount:arize-operator:*
-
The roles must have a trust policy that allows these service accounts to assume the role:
-
If not using IAM roles for service accounts (IRSA):
- The policy actions should be added to the role attached to the nodes.
- Pods should be able to discover the node role through instance metadata.
small1b or medium2b.values.yaml:cloud: "aws"
clusterName: "arn:aws:eks:<region>:<account-id>:cluster/<cluster-name>"
hubJwt: "<JWT>" (base64 encoded)
gazetteBucket: "<gazette-bucket-name>"
druidBucket: "<adb-bucket-name>"
postgresPassword: "<user selected postgres password>" (base64 encoded)
organizationName: "<name of the organization or company>"
cipherKey: "<encryption key>" (base64 encoded)
clusterSizing: "<sizing option>"
region: "<region>"
serverSideEncryption: ""
collectNodeMetrics: true
# The URL used to reach the Arize AX UI once ingress endpoints are created
appBaseUrl: "https://<arize-app.domain>"
# Omit this field if using node level roles instead of IAM roles for service accounts (IRSA)
awsServiceAccountRoleRwBucket: "arn:aws:iam::<account-id>:role/<read-write-role>"
# Only required if using a private docker registry
pushRegistry: "<account-id>.dkr.ecr.<region>.amazonaws.com"
pullRegistry: "<account-id>.dkr.ecr.<region>.amazonaws.com"
# Only required if using a common node pool
historicalNodePoolEnabled: false
helm upgrade --install -f values.yaml arize-op arize-operator-chart.tgz
This section provides guidance for installing Arize AX on Red Hat OpenShift, including Red Hat OpenShift Service on AWS (ROSA). The diagram below shows a typical OpenShift private-cloud topology.
Contact Arize for the <sizing option> field. This field controls the deployment size and must align with the size of the cluster. Common production values are Run the standard installer from the extracted distribution directory:If you need to run Helm directly:For the complete installation walkthrough, consult Advanced → Helm in the documentation included with your distribution.To remove an existing Arize AX install from this cluster before reinstalling, follow Fresh reinstall cleanup.
Contact Arize for the <sizing option> field. This field controls the deployment size and must align with the size of the cluster. Common production values are small1b and medium2b.values.yaml:cloud: "ceph"
clusterName: "<cluster-name>"
hubJwt: "<base64-encoded-runtime-registry-jwt>"
gazetteBucket: "<gazette-bucket-name>"
druidBucket: "<adb-bucket-name>"
postgresPassword: "<base64-encoded-postgres-password>"
organizationName: "<name of the organization or company>"
cipherKey: "<base64-encoded-encryption-key>"
clusterSizing: "<sizing option>"
cephS3Endpoint: "<URL to the Ceph S3 endpoint>"
cephS3AccessKeyId: "<base64-encoded-ceph-s3-access-key-id>"
cephS3SecretAccessKey: "<base64-encoded-ceph-s3-secret-access-key>"
collectNodeMetrics: true
# The URL used to reach the Arize AX UI once ingress endpoints are created
appBaseUrl: "https://<arize-app.domain>"
# Only required if using a private container registry
pushRegistry: "<container-registry>"
pullRegistry: "<container-registry>"
# Only required if using a common node pool
historicalNodePoolEnabled: false
# Change to align with namespaces
baseRunAsUser: 1000
baseRunAsGroup: 1000
baseFsGroup: 1000
postgresRunAsUser: 70
postgresRunAsGroup: 70
postgresFsGroup: 70
operatorRunAsUser: 1000
operatorRunAsGroup: 1000
operatorFsGroup: 1000
ingressMode: 'openshift'
./arize.sh -y install
helm upgrade --install -f values.yaml arize-op arize-operator-chart.tgz
This process is not for production. Use it only for testing or development.
Prerequisites
Before starting, have the following ready:- Arize AX distribution access — JWT token for downloading the distribution from Arize AI
- Passwords and secrets — You will choose a MinIO password, Postgres password, and encryption key (all base64-encoded in
values.yaml) - Organization name — Name of your organization or company (for
values.yaml) - App URL — The URL you will use to reach the Arize AX UI (e.g.
https://arize-app.yourdomain.com). This can be a hostname you map to a private IP (see Step 8) if the VM has no public address. - Network access to the VM — If the VM only has a private IP (for example
10.x.x.x,172.16.x.x, or192.168.x.x), your browser, SDKs, and any clients must run on a host that can reach that address (same VPC or subnet, site-to-site VPN, or client VPN into the cloud network). You do not need a public IP for this guide.
Step 1: Create the virtual machine
Create a single VM with these specifications:| Requirement | Specification |
|---|---|
| Size | 16 vCPU, 128 GB RAM — e.g. n2d-highmem-16 (GCP), r7a.4xlarge (AWS), Standard_E16s_v5 (Azure) |
| OS | Debian base image |
| Boot disk | 500 GB |
| Network | Allow HTTP and HTTPS traffic to the VM. If the machine uses a private IP only, restrict ingress to trusted CIDRs (for example your VPC, office, or VPN range) instead of the open internet. If the VM has a public IP, you can allow HTTP/HTTPS from the internet or from specific IPs, depending on your policy. |
| Access | SSH (port 22). Add firewall rules and an SSH key as required by your cloud provider. For a private-IP-only host, allow SSH from your bastion, VPN, or admin network. |
Step 2: Install k3s
SSH into the machine, then run:curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--write-kubeconfig-mode=644" sh - && \
mkdir -p ~/.kube && \
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config && \
sudo chown $USER ~/.kube/config && \
chmod 600 ~/.kube/config
kubectl get nodes. You should see your node with status Ready and role control-plane, for example:NAME STATUS ROLES AGE VERSION
<your machine name> Ready control-plane 176m v1.34.5+k3s1
Step 3: Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version. You should see JSON output with BuildInfo for your Helm installation.Step 4: Install MinIO (object storage)
MinIO provides the S3-compatible object storage used by Arize AX. The commands below create aminio directory, a values file, and install the MinIO Helm chart.- Persistence size: Set
persistence.sizeto the max storage per bucket (e.g. up to 75Gi). With a 500 GB boot disk, 75Gi per bucket leaves space for other PVCs. - Credentials: Set
rootPasswordto a password you choose; you will use the same user/password invalues.yamllater.
mkdir minio && cd minio
cat > minio-values.yaml << 'EOF'
rootUser: minio
rootPassword: <your chosen password>
persistence:
size: <up to 75Gi>
storageClass: local-path
replicas: 1
mode: standalone
buckets:
- name: gazette-bucket
policy: none
purge: false
versioning: true
objectlocking: false
- name: adb-bucket
policy: none
purge: false
versioning: true
objectlocking: false
EOF
helm repo add minio https://charts.min.io/
helm install minio minio/minio -f minio-values.yaml
Step 5: Retrieve the Arize AX distribution
Create a release folder, download the distribution, and extract it. Replace<your JWT token> with your actual JWT.VERSION=${1:-$(curl -s https://arize.com/docs/ax/selfhosting/on-premise-releases | grep -Eo 'Release [0-9]+\.[0-9]+\.[0-9]+' | head -1 | awk '{print $2}')}
cd ../ && mkdir arize-release-$VERSION && cd arize-release-$VERSION
URL=https://ch.hub.arize.com/dist
JWT="<your JWT token>"
curl -H "Authorization: Bearer $JWT" "$URL/distributions/arize-distribution-$VERSION.tar" --output "arize-distribution-$VERSION.tar"
tar xvf *.tar
Step 6: Create values.yaml
From thearize-release-* directory, create values.yaml by editing the placeholders below and pasting the result into your terminal. All values marked “(base64 encoded)” must be base64-encoded.- hubJwt: Your Arize AX JWT (base64)
- postgresPassword: A password you choose (base64)
- cephS3AccessKeyId: MinIO user from Step 4, e.g.
minio(base64) - cephS3SecretAccessKey: MinIO password from Step 4 (base64)
- cipherKey: An encryption key you generate (base64)
- appBaseUrl: The URL where you will access the Arize AX UI
cat > values.yaml << 'EOF'
cloud: "ceph"
clusterName: "default"
hubJwt: "<JWT>" (base64 encoded)
gazetteBucket: "gazette-bucket"
druidBucket: "adb-bucket"
postgresPassword: "<user selected postgres password>" (base64 encoded)
organizationName: "<name of the organization or company>"
clusterSizing: "nonha"
cephS3Endpoint: "http://minio.default.svc.cluster.local:9000"
cephS3AccessKeyId: "<Minio user set in the previous step>" (base64 encoded)
cephS3SecretAccessKey: "<Minio password set in the previous step>" (base64 encoded)
cipherKey: "<encryption key>" (base64 encoded)
storageClassCephSsd: "local-path"
storageClassCephStandard: "local-path"
ingressMode: "notls"
# The URL used to reach the Arize AX UI once ingress endpoints are created
# Use the same hostname you will put in /etc/hosts (Step 8), e.g. https://arize-app.example.local
# Private-IP only: the hostname must resolve (via hosts file or DNS) to the VM's private IP for clients on that network
appBaseUrl: "https://<arize-app.domain>"
# Only required if using a private docker registry
pushRegistry: "<docker-registry>"
pullRegistry: "<docker-registry>"
baseOverlay: |
---
apiVersion: v1
kind: Service
metadata:
name: internalendpoints-app
namespace: arize
annotations:
traefik.ingress.kubernetes.io/service.serversscheme: h2c
EOF
Step 7: Install Arize AX
From the samearize-release-* directory, run:./arize.sh install
kubectl get pods -n arize to see all pods in a running state.Step 8: Configure ingress
Ingress exposes the Arize AX UI over HTTPS. You can use any certificate; this step uses a self-signed certificate for a low-effort setup. You can choose any domain (e.g.arize-app.example.local) — no real DNS record is required because you will use /etc/hosts to point the hostname to your VM.Private IP only: If the VM has no public IP, use the VM’s private address in /etc/hosts on every client that should open the UI or send data (your laptop on VPN, a jump host, or a build agent in the same VPC). The certificate and ingress hostnames stay the same; only the IP you map must be reachable from that client.8a. Generate a self-signed certificate
From the home directory, generate the cert. Replace<your domain> with the domain you want to use (e.g. example.local).openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-out tls.crt \
-keyout tls.key \
-subj "/CN=arize-app.<your domain>/O=Arize" -addext "subjectAltName = DNS:arize-app.<your domain>"
8b. Get base64 values for the ingress manifests
You will paste these values into the ingress YAML in the next step. Run:base64 -w 0 tls.crt # use this for tls.crt in the manifest
base64 -w 0 tls.key # use this for tls.key in the manifest
8c. Create and apply the ingress manifests
Create aningress directory and an ingress.yaml file. In the manifest below, replace:<your domain>— The same domain you used in the certificate (e.g.example.local)<your value for tls.crt>— The full output ofbase64 -w 0 tls.crt<your value for tls.key>— The full output ofbase64 -w 0 tls.key
mkdir -p ingress && cd ingress
cat > ingress.yaml << 'EOF'
apiVersion: v1
kind: Secret
metadata:
name: arize-app-services-tls
namespace: arize
type: kubernetes.io/tls
data:
tls.crt: "<your value for tls.crt>"
tls.key: "<your value for tls.key>"
apiVersion: v1
kind: Secret
metadata:
name: default-ingress-cert
namespace: kube-system
type: kubernetes.io/tls
data:
tls.crt: "<your value for tls.crt>"
tls.key: "<your value for tls.key>"
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: arize-app-services
namespace: arize
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
spec:
ingressClassName: traefik
tls:
- hosts:
- arize-app.<your domain>
secretName: arize-app-services-tls
rules:
- host: arize-app.<your domain>
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: internalendpoints-app
port:
number: 443
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: arize-app
namespace: arize
spec:
entryPoints:
- websecure
tls:
secretName: arize-app-services-tls
routes:
- match: Host(`arize-app.<your domain>`)
kind: Rule
services:
- name: internalendpoints-app
port: 443
apiVersion: traefik.io/v1alpha1
kind: TLSStore
metadata:
name: default
namespace: kube-system
spec:
defaultCertificate:
secretName: default-ingress-cert
EOF
kubectl apply -f ingress.yaml
8d. Point the hostname to your VM and open the UI
On each machine where you want to use the Arize AX UI (your laptop or another host), add a line to/etc/hosts so arize-app.<your domain> resolves to the VM’s IP. This replaces a DNS record for testing.- Edit hosts (e.g.
sudo vi /etc/hostsorsudo nano /etc/hosts). - Add a line:
<VM IP address> arize-app.<your domain>
Use the same domain as in the certificate and ingress.
- Public IP: If your cloud VM has a public address, use that IP in
/etc/hostsfrom any client that can reach it (subject to your security group or firewall rules). - Private IP only: If the VM has only a private IP, use that private address. The client must be on a network that can route to it (for example same VPC, peered network, or connected VPN). If you reach the VM only via SSH through a bastion, you still need a path for HTTPS (port 443) from the browser/SDK machine to the Arize AX node—either run the browser on a host inside the VPC, use VPN, or forward ports with
ssh -Land pointlocalhostin/etc/hoststo match your tunnel setup.
# Existing entries (leave as-is)
127.0.0.1 localhost
::1 localhost
# Arize AX single-host VM (public or private IP — use the address your client can reach)
<ip address of your VM> arize-app.<your domain>
https://arize-app.<your domain>. You should see the Arize AX login page. Accept the self-signed certificate warning if prompted, then sign in with your initial admin credentials. You can now use Arize AX.Step 9: Validate deployment
The distribution includes example scripts underexamples/sdk. Use them to confirm the cluster can receive traces.- Certificate: The deployment uses a self-signed cert. Have the certificate (e.g.
tls.crt) available on the machine where you run the script, or configure the script to skip TLS verification if it supports that. - Network: Run the script from a host that can reach the VM’s reachable address for HTTPS (public IP or private IP). If the VM uses only a private IP, run the SDK from a host on the same VPC/VPN or with routing to that host, and use the same hostname in
/etc/hosts(or DNS) as inappBaseUrl.
arize-release-* directory (or wherever you extracted the distribution), run the HTTP trace sample:cd examples/sdk
# Use the script’s options to point at your Arize AX endpoint and cert if required
python trace_sample_http.py
Use this page when Kubernetes is already running and is managed outside GKE, EKS, AKS, or OpenShift. This includes Rancher-managed clusters, Talos clusters, and other bare-metal or private Kubernetes environments.Your platform team owns the cluster-level choices: storage classes, ingress, DNS, registry access, and object storage. The Arize AX install flow is still the same: point If you are using Rancher or Talos, confirm which kubeconfig cluster entry the installer should use:Set If a previous install failed partway through, check the operator, jobs, and pods before deciding whether to continue with
Use Create
Create Discuss these with Arize AI before using them in production:Some small clusters also need ArizeDB historical JVM/resource tuning, toleration changes, or Do not base64 encode Use a longer timeout on smaller clusters and on clusters that pull images from an external registry.Some distributions install the operator chart without creating the Workaround: in your local copy of the distribution, edit Two event messages usually point to this issue:The Promtail is not in the install-blocking path. The rest of Arize AX installs and runs without it, but in-cluster log shipping to Loki will be missing until you address the policy. The simplest fix is to relax PSA enforcement for the
A successful install should leave the Arize AX Operator running in Run from a workstation with If a namespace is still stuck after its pods are gone, remove the namespace finalizer. Use this as a last resort: it bypasses controllers that may still be reconciling resources.Before reinstalling, verify both namespaces are gone:Verify no leftover pods or resources exist in the recreated namespace before continuing.If the cluster uses Longhorn, also check for old Arize AX volumes before reinstalling. Repeated test installs can leave detached Longhorn volumes even after Kubernetes namespaces and PVCs are gone, and those volumes still consume disk:Delete only volumes that clearly belong to the previous Arize AX install and whose data you intend to discard. Do not delete volumes for other namespaces or other applications.A The operator recreates fresh init jobs on the next reconcile. The most common stuck job is
kubectl at the target cluster, create values.yaml next to arize.sh, then run ./arize.sh install from the extracted distribution directory.Before you start
You need:- A working kubeconfig for the target cluster.
- Block-storage-backed persistent volumes. Do not use NFS-backed volumes for Arize AX persistent volumes.
- Object storage for Gazette and ArizeDB data. For bare-metal installs this is usually MinIO, Ceph, or another S3-compatible service.
- A storage class name for standard persistent volumes and SSD-style persistent volumes. They can be the same storage class if the cluster does not split storage tiers.
- The application URL you plan to expose Arize AX at, for
appBaseUrl. DNS does not need to resolve at install time; you can wire DNS and ingress afterwards. Pick the URL you intend to keep, though: OAuth callbacks, application redirects, and the configmap rendered by the operator all referenceappBaseUrl, so changing it later means re-rendering the install configuration. - The release version from On-Premise Releases, plus distribution access, organization name, and sizing profile from Arize AI.
Use dedicated namespaces for Arize AX. The examples use
arize and arize-operator; if you choose different names, keep them dedicated to this Arize AX install. Do not install Arize AX into a namespace shared with other applications. Cleanup and reinstall commands can delete namespace resources and are not safe for shared namespaces.kubectl config get-clusters
kubectl config current-context
clusterName to the cluster entry from kubectl config get-clusters. Do not assume it is always the same as the current context name.Decide whether this is an upgrade or a reset
If Arize AX is already installed, running./arize.sh install again is the normal path for a refresh, a redeploy of operator-managed manifests, or an upgrade with your current values.yaml.Use Fresh reinstall cleanup only when you intentionally want to discard the existing install and reinstall from an empty target.
./arize.sh install or reset the target. A reset deletes in-cluster Arize AX data unless you have a backup/restore plan.Choose the storage mode
For most Rancher and bare-metal installs, start with one of these values:| Environment | cloud value | Object storage |
|---|---|---|
| Operator-managed MinIO in the cluster | minio | MinIO deployed with Arize AX |
| Existing Ceph or S3-compatible endpoint | ceph | External S3-compatible service |
minio when you want the Arize AX install to manage MinIO in the cluster. Use ceph when your platform team already provides an S3-compatible endpoint and credentials.With cloud: minio, the operator’s install-minio-init job creates gazetteBucket and druidBucket inside the operator-managed MinIO during the install. Pick names that are unique to your install (they only need to be unique within this MinIO); you do not pre-provision them. With cloud: ceph or another S3-compatible endpoint, pre-provision both buckets in your platform’s storage and grant the install credentials read/write access.Create values.yaml
Create values.yaml in the extracted distribution directory, next to arize.sh.This minimal example is for a bare-metal cluster using operator-managed MinIO:helmNamespace: arize-operator
cloud: "minio"
clusterName: "<cluster-entry-from-kubectl-config-get-clusters>"
organizationName: "<organization-name>"
clusterSizing: "<sizing-profile>"
gazetteBucket: "<gazette-bucket-name>"
druidBucket: "<adb-bucket-name>"
hubJwt: "<base64-encoded-runtime-registry-jwt>"
postgresPassword: "<base64-encoded-postgres-password>"
cipherKey: "<base64-encoded-cipher-key>"
minioPassword: "<base64-encoded-minio-password>"
appBaseUrl: "https://<arize-app-domain>"
storageClassMinioStandard: "<standard-storage-class>"
storageClassMinioSsd: "<ssd-storage-class>"
# Set this to false if the cluster does not have a separate ArizeDB node pool.
historicalNodePoolEnabled: false
# Leave empty to use the Postgres deployed by the operator.
postgresHostEndpoint: ""
hubJwt is the runtime registry credential that Arize AI provides for pulling images from ch.hub.arize.com (or your mirror, if you set pullRegistry). It is not the same value as the JWT used for the one-time tarball download; Arize AI provides both, and arize.sh does not derive one from the other.To point at a managed or customer-provided Postgres instance instead, set postgresHostEndpoint to its hostname and follow External Postgres requirements for supported versions, sizing, parameters, and database initialization.For smaller bare-metal clusters, ask Arize AI which sizing profile to use. Do not use small1b or medium2b unless the nodes match the sizing requirements in Cluster sizing.Small cluster values
Small Rancher, Talos, or homelab clusters often need extra values beyond the minimal example. These are not universal production defaults, but they are common when the cluster has a shared node pool or less memory than the standard sizing profiles expect.Before using operator-managed MinIO on a small cluster, confirm the storage budget for your release. Some releases create eight 150 Gi PVCs for MinIO alone (two per replica), before the other Arize AX volumes are created. Longhorn or another replicated storage backend may need more physical disk than the PVC total. If that does not fit comfortably, use an external S3-compatible store with
cloud: ceph, add storage, or work with Arize AI on a smaller storage plan before installing.# Use a shared node pool instead of a dedicated ArizeDB pool.
historicalNodePoolEnabled: false
# Use node emptyDir for component scratch space instead of large default PVCs.
ephemeralMode: emptyDir
# Disable autosizing on small clusters: with autosizing on, the ArizeDB
# historical container can request 24 GiB of memory, which will not fit on
# a typical homelab node and the pod will stay Pending.
autoSizeMemory: false
autoSizeReplicas: false
baseOverlay patches. Treat those as environment-specific overrides, not copy/paste defaults.If your nodes are tainted, make the Arize AX tolerations match the taints your platform uses. The toleration values are strings, so keep the list quoted:podTolerationAll: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
podTolerationBase: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
podTolerationHist: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
# Only set this if legacy Spark is enabled.
podTolerationSpark: "[{key: 'workload', operator: 'Equal', value: 'arize', effect: 'NoSchedule'}]"
baseOverlay is a multiline YAML patch that the operator applies to Arize AX application manifests. Use it for targeted Arize AI-reviewed changes, such as changing a replica count or a container resource request. Paste it under baseOverlay: | exactly as provided:baseOverlay: |
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: druid-compaction
namespace: arize
spec:
replicas: 1
Do not put
volumeClaimTemplates inside a baseOverlay. Kubernetes does not allow an existing StatefulSet’s volumeClaimTemplates to be changed. If this happens, the operator can stop reconciling with a Forbidden error. To use different PVC sizes, set the chart value before the first install when one is available. If the StatefulSet already exists, work with Arize AI on a cleanup and recreate plan for that component.Encoding values
The secret fields in the example must be base64-encoded. Encode a short value with:printf '%s' '<value>' | base64 | tr -d '\n'
clusterName, organizationName, bucket names, storage class names, registry hostnames, or appBaseUrl.Run the install
Run commands from the extracted distribution directory:./arize.sh -y -t 5400 install
arize-operator namespace first, which causes arize.sh install to fail with:namespaces "arize-operator" not found
Do not fix this by creating
arize-operator with a plain kubectl create namespace command. Helm needs to own the namespace; a plain kubectl create namespace lacks the ownership annotations Helm expects, and a later upgrade can fail.arize.sh so the operator-chart helm upgrade --install command passes --create-namespace. The two lines to edit are the operator-chart installs (search for arize-op in arize.sh); add --create-namespace after helm upgrade --install. Re-run ./arize.sh -y -t 5400 install. Helm creates the namespace with proper ownership metadata, so subsequent upgrades work.Talos and PodSecurity notes
Talos and other bare-metal clusters often enforce the Kubernetes PodSecuritybaseline policy, which rejects MinIO pods that bind hostPort: 9000. Check events when MinIO is not fully ready:kubectl get events -n arize --sort-by=.lastTimestamp
kubectl get sts minio -n arize
violates PodSecurity "baseline"and mentionshostPortdidn't have free ports for the requested pod ports
hostPort: 9000 still set, Kubernetes can place only one MinIO pod per node. A two-node cluster cannot run all four MinIO replicas until hostPort is removed.The on-prem MinIO StatefulSet has two containers (mc and minio), and their order is not guaranteed across releases. Remove hostPort from whichever container is named minio:IDX=$(kubectl get sts minio -n arize \
-o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' \
| grep -n '^minio$' | head -1 | cut -d: -f1)
if [ -z "$IDX" ]; then
echo "MinIO container not found; contact Arize for an updated patch."
else
IDX=$((IDX - 1))
kubectl patch sts -n arize minio --type='json' -p="[
{\"op\":\"test\",\"path\":\"/spec/template/spec/containers/$IDX/name\",\"value\":\"minio\"},
{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$IDX/ports/0/hostPort\"}
]"
kubectl rollout restart statefulset/minio -n arize || true
fi
test operation guards against the StatefulSet layout changing in future releases. If the patch errors out, the StatefulSet is unchanged; contact Arize AI for an updated patch.After patching MinIO, continue checking the install status. Some releases re-render the MinIO StatefulSet and reintroduce hostPort, so repeat the patch if MinIO drops back to 3/4 and events mention requested pod ports. MinIO is a four-replica distributed StatefulSet, and the cluster cannot serve API requests or create buckets until all four replicas are Ready.Promtail and PodSecurity baseline
The Arize AX chart deploys apromtail DaemonSet that mounts node hostPath volumes (/var/log/pods, /var/lib/docker/containers, and similar) to ship pod logs to Loki. The PodSecurity baseline policy rejects hostPath volumes, so on Talos, kyverno-enforced clusters, and other restricted-PSA environments no promtail pod can schedule. Look for promtail in kubectl get events -n arize:pods "promtail-xxxxx" is forbidden: violates PodSecurity "baseline:latest":
hostPath volumes (volumes "run", "containers", "pods")
arize namespace:kubectl label ns arize pod-security.kubernetes.io/enforce=privileged --overwrite
kubectl rollout restart daemonset/promtail -n arize
This relaxes PodSecurity for the
arize namespace as a whole, not just promtail. If your platform team requires a tighter scope, use a kyverno PolicyException (or your policy engine’s equivalent) targeting only the promtail DaemonSet’s hostPath volumes instead.Check the install
Afterarize.sh install finishes, do not rely only on the shell exit code. Confirm the operator and pods are healthy:kubectl get pods -n arize-operator
kubectl get pod arize-op-arize-operator-operator-0 -n arize-operator \
-o jsonpath='{.metadata.annotations.operator\.arize\.com/status}{"\n"}'
kubectl get pods -n arize
kubectl get jobs -n arize
./arize.sh -y install-status can also be useful when Arize AI Support asks for a deeper status check, but it prints the startup configuration, including secret values such as hubJwt, and it can produce a lot of output. Redact the output before sharing it.Useful operator statuses:| Status | Meaning |
|---|---|
Executing | The operator is rendering and applying manifests. |
Installing | Core install jobs and dependencies are still starting. |
Delayed | The operator is waiting for active install jobs to finish before it can reconcile. If a job is stuck Running for an unreasonable time, see Troubleshooting failed or stuck jobs. |
Running | The operator reports the deployment is complete. |
Error | Check operator logs and failed jobs. |
arize-operator, all install jobs completed in arize, and application pods running.Fresh reinstall cleanup
This procedure removes an Arize AX install from any Kubernetes cluster, including managed cloud, OpenShift, Rancher, and bare metal, so you can reinstall from scratch.Use this only when you are intentionally discarding the previous Arize AX install. The PVC deletion below removes in-cluster Arize AX data. Only run this when data loss is acceptable or you have a backup and restore plan.
kubectl and helm configured for the target cluster:helm uninstall arize-op -n arize-operator || true
kubectl delete all,cm,secret,pvc,sa,role,rolebinding,cronjob,job \
-n arize --all --ignore-not-found=true --wait=false || true
kubectl delete all,cm,secret,pvc,sa,role,rolebinding,job \
-n arize-operator --all --ignore-not-found=true --wait=false || true
kubectl delete ns arize --wait=false || true
kubectl delete ns arize-operator --wait=false || true
kubectl wait --for=delete ns/arize --timeout=300s || true
kubectl wait --for=delete ns/arize-operator --timeout=300s || true
helm uninstall removes the operator chart’s cluster-scoped resources (such as the Arize AX Prometheus node ClusterRole and ClusterRoleBinding) automatically. The chart applies a helm.sh/resource-policy: keep annotation to both namespaces, so the explicit kubectl delete ns calls above are required.Some Arize AX pods use a long graceful termination window. Gazette, for example, sets terminationGracePeriodSeconds: 1500 (25 minutes). If namespace deletion stays in Terminating for more than a few minutes, force-delete any stuck pods to unblock it:kubectl get pods -n arize -o name \
| xargs -r kubectl delete -n arize --grace-period=0 --force
kubectl get ns arize -o json \
| jq 'del(.spec.finalizers)' \
| kubectl replace --raw /api/v1/namespaces/arize/finalize -f -
kubectl get ns arize || true
kubectl get ns arize-operator || true
kubectl get volumes.longhorn.io -n longhorn-system
Troubleshooting failed or stuck jobs
If the operator reportsError, Delayed, or says bad jobs exist, check jobs and logs:kubectl get jobs -n arize
for job in install-postgres-init install-minio-init install-gazette-init install-druid-init; do
echo "===== $job ====="
pod=$(kubectl get pod -n arize -l job-name=$job -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true)
[ -n "$pod" ] && kubectl logs -n arize "$pod" --tail=120
done
Failed job blocks retry. So does a Running job that has been waiting for hours on a dependency that has since recovered. The init pod will not reconnect, and the operator stays Delayed reconciling around it. In either case, delete the Arize AX-owned job and re-apply:kubectl delete job <failed-or-stuck-job-name> -n arize || true
./arize.sh -y apply
install-minio-init, which loops Waiting to create local alias... if MinIO was unhealthy at job start time.Log safety
arize.sh prints secret values such as hubJwt in the startup configuration block at the top of every install run. Redact install logs before sharing them with anyone outside your team.Shared configuration
- External PostgreSQL requirements
- Configuring ingress and endpoints
- Ingress — NGINX, Istio, Kong, and others
- Configuring SAML