Data Fabric

Seamlessly sync traces to external data sources

Data Fabric is currently on waitlist - reach out to Arize Support to get started.

What is Data Fabric?

Data Fabric automatically syncs production trace data, evaluations, and annotations to your cloud data warehouse every 60 minutes in Iceberg format—giving you a single, always-current source of truth. Having access to the raw trace data enables teams to leverage this data for analytics and custom workflows.

This is a very powerful feature of adb - the Arize GenAI native datastore. Learn more about adb:

Why is this better than a standard data export?

No lock-in: Your data is always available to you. You can move it to data warehouse of your choice so you can use it in other tools you already use.
Single source of truth, always: No need to maintain separate copies or manage export jobs
Automatic updates: Your data, up-to-date. Any updates via evaluations or annotations—even on months-old data—are captured in the next sync regardless of timestamp. Data syncs every 60 minutes.
Query-ready format: Data is stored in Iceberg format for direct querying in BigQuery, Snowflake, and other data warehouses
Time-partitioned: Leverages Hive standard storage for efficient time-based queries

How does Data Fabric work?

Data Fabric is only enabled for Enterprise accounts. Reach out to Arize Support (support@arize.com) if you'd like to trial access.
To set up Data Fabric, you must have write permissions to your target cloud storage bucket and at least one tracing project in your space
Connectors are a connection to a filepath within your bucket. When you create a connector, you'll be able to specify a bucket and namespace, as well as projects to sync. You can add any number of projects or create one connector per project. Each connector must have a unique filepath.
Once you've created your connectors and added your projects, your data will sync automatically every 60 minutes. This includes updates to historical data that may have changed.
Project syncs can be paused, resumed or deleted as needed.
Supported Blobstore Providers: Google Cloud Storage (GCS)
- Coming Soon: Amazon S3 and Azure Storage
Supported Big Table Providers: BigQuery (GCS)
- Coming Soon: Snowflake and Databricks

Setting up Data Fabric

Step 1: Create a Data Connector

Navigate to Settings > Data Fabric in your space
Click New Connector
Fill out the basic connector information:
1. Connector Name: A descriptive name for your connector
2. Select Projects: Choose which tracing projects to sync. You can modify these projects later

Step 2: Configure Cloud Storage

Select Data Storage: Currently, only Google Cloud Storage is supported.
File Path: Enter your GCS path in the format: my-data-bucket/arize-sync/production

Step 3: Set Up Permissions

1. Label Your Bucket: In the GCS bucket you are uploading from, set a bucket label with a key of arize-ingestion-key and the corresponding value copied from the setup dialog. This proves ownership of the bucket and authorization to access the data that will be synced into it.

Key: arize-ingestion-key
Value: See setup dialog

2. Create IAM Role: Run the provided command to create a custom IAM role.

gcloud iam roles create arizeDataFabric 
    --project=YOUR_PROJECT_ID
    --title="Arize Data Fabric Role" 
    --description="Custom IAM role for Arize Data Fabric"
    --permissions=storage.buckets.get,storage.objects.get,storage.objects.list,storage.objects.create,storage.objects.update,storage.objects.delete 
    --stage=ALPHA

3. Apply IAM Permissions: Grant the IAM role permission to your bucket.

gsutil iam ch serviceAccount:arize-data-fabric@production-269901.iam.gserviceaccount.com:projects/<YOUR_PROJECT_ID>/roles/arizeDataFabric <YOUR_FILEPATH>

Step 4: Validate and Start Sync

Validate: Click Validate to verify your configuration
Start Syncing: Once validated, click Start Job to begin syncing. Your first sync will begin immediately and then continue every 60 minutes

Step 5: Setup BigQuery Tables

Support for Snowflake and Databricks is coming soon.

Allow the initial sync to complete: Allow the first sync to complete. Sync time depends on data size and shape, and may vary.
Create Table: Once your data is syncing to GCS, you can create BigQuery external tables to query the data directly. For each project being synced, create an external table using the Iceberg format:

CREATE EXTERNAL TABLE `your-project.your-dataset.your-table`
OPTIONS (
   format = 'ICEBERG',
   uris = ['gs://your-bucket/path/namespace/project-name/metadata/latest.metadata.json']
);

Frequently Asked Questions

How often does data sync? Data syncs every 60 minutes automatically.
Can I sync multiple spaces to the same bucket? Yes, you can configure multiple connectors to write to the same bucket using different namespaces.
What happens if I delete a project that's being synced? The sync will stop for that project, but existing data in your storage will remain.
Can I change the sync frequency? Currently, the sync frequency is fixed at 60 minutes and cannot be customized. Customization is coming soon.
Is historical data included in the sync? Yes, all historical data is included in the initial sync, and any changes to historical data between syncs will be included in the next sync.
What's the difference between Data Fabric and manual exports? Data Fabric provides automatic, continuous syncing with evaluations and annotations, while exports are manual snapshots at a point in time.

Last updated 1 day ago

Was this helpful?