Skip to main content
Automatically generate embeddings for text, images, and structured data using pre-trained models.

Key Capabilities

  • Pre-trained models for common use cases
  • Batch processing for efficiency
  • Automatic handling of tokenization and preprocessing
  • Support for custom models

What is an Embedding?

Embeddings are vector representations of data. Embeddings are everywhere in modern deep learning, such as transformers, recommendation engines, layers of deep neural networks, encoders, and decoders.

Why Embeddings for Analyzing Deep Learning Models?

Data drift in unstructured data like images is complicated to measure. The measures typically used for drift in structured data do not extend to unstructured data. The general challenge with measuring unstructured data drift is that you need to understand the change in relationships inside the unstructured data itself. Embeddings are needed for users to access Arize’s UMAP product line.

Quick Start

import pandas as pd
from arize.embeddings import EmbeddingGenerator, UseCases

# List available models
print(EmbeddingGenerator.list_pretrained_models())

# Create example data
df = pd.DataFrame({
    "text": [
        "The product quality is excellent.",
        "Shipping was delayed by 3 days.",
        "Customer service was very helpful.",
    ],
})

# Generate embeddings for NLP
generator = EmbeddingGenerator.from_use_case(
    use_case=UseCases.NLP.SEQUENCE_CLASSIFICATION,
    model_name="distilbert-base-uncased",
    tokenizer_max_length=512,
    batch_size=100,
)

df["text_vector"] = generator.generate_embeddings(text_col=df["text"])

Supported Use Cases

Use CaseModel Types
UseCases.NLP.SEQUENCE_CLASSIFICATIONBERT, DistilBERT, RoBERTa
UseCases.NLP.SUMMARIZATIONBART, T5, Pegasus
UseCases.CV.IMAGE_CLASSIFICATIONResNet, VGG, EfficientNet
UseCases.CV.OBJECT_DETECTIONYOLO, Faster R-CNN
UseCases.STRUCTURED.TABULAR_EMBEDDINGSCustom tabular encoders