> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# How To Generate Your Own Embedding

Embedding vectors are generally extracted from the **activation values** of one or many hidden layers of your model.

#### Ways to obtain embedding vectors

In general, there are many ways of obtaining embedding vectors, including:

1. Word embeddings

2. Autoencoder Embeddings

3. Generative Adversarial Networks (GANs)

4. Pre-trained Embeddings

Given the accessibility to pre-trained transformer models, we will focus on them. This involves using models, such as BERT or GPT-x, trained on a large dataset and made publicly available, then fine-tuning them on a specific task.

### Use Case Examples

Once established the choice of models to generate embeddings, the question is: \*how? \*The way you generate your embedding must be such that the resulting vector represents your input according to your use case.

<Tabs>
  <Tab title="CV Image Classification">
    If you are working on image classification, the model will take an image and classify it into a given set of categories. Each of our embedding vectors should be representative of the corresponding entire image input.

    First, we need to use a `feature_extractor` that will take an image and prepare it for the large pre-trained image model.

    ```csharp theme={null}
    inputs = feature_extractor(
        [x.convert("RGB") for x in batch["image"]],
        return_tensors="pt"
    ).to(device)
    ```

    Then, we pass the results from the `feature_extractor` to our `model`. In PyTorch, we use `torch.no_grad()` since we don't need to compute the gradients for backward propagation, we are not training the model in this example.

    ```python theme={null}
    with torch.no_grad():
        outputs = model(**inputs)
    ```

    It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.

    ```python theme={null}
    last_hidden_state = outputs.last_hidden_state
    # last_hidden_state.shape = (batch_size, num_image_tokens, hidden_size)
    ```

    Finally, since we want the embedding vector to represent the entire image, we will average across the second dimension, representing the areas of the image.

    ```
    embeddings = torch.mean(last_hidden_state, 1).cpu().numpy()
    ```
  </Tab>

  <Tab title="NLP Classification">
    If you are working on NLP sequence classification (for example, sentiment classification), the model will take a piece of text and classify it into a given set of categories. Hence, your embedding vector must represent the entire piece of text.

    For this example, let us assume we are working with a model from the `BERT` family.

    First, we must use a `tokenizer` that will the text and prepare it for the pre-trained large language model (LLM).

    ```
    inputs = {
            k: v.to(device)
            for k,v in batch.items() if k in tokenizer.model_input_names
    }
    ```

    Then, we pass the results from the `tokenizer` to our `model`. In PyTorch, we use `torch.no_grad()` since we don't need to compute the gradients for backward propagation, we are not training the model in this example.

    ```python theme={null}
    with torch.no_grad():
        outputs = model(**inputs)
    ```

    It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.

    ```python theme={null}
    last_hidden_state = outputs.last_hidden_state
    # last_hidden_state.shape = (batch_size, num_tokens, hidden_size)
    ```

    Finally, since we want the embedding vector to represent the entire piece of text for classification, we will use the vector associated with the classification token,`[CLS]`, as our embedding vector.

    ```
    embeddings = last_hidden_state[:,0,:].cpu().numpy()
    ```
  </Tab>

  <Tab title="NLP Named Entity Recognition">
    If you are working on NLP Named Entity Recognition (NER), the model will take a piece of text and classify some words within it into a given set of entities. Hence, each of your embedding vectors must represent a classified word or token.

    For this example, let us assume we are working with a model from the `BERT` family.

    First, we must use a `tokenizer` that will the text and prepare it for the pre-trained large language model (LLM).

    ```
    inputs = {
            k: v.to(device)
            for k,v in batch.items() if k in tokenizer.model_input_names
    }
    ```

    Then, we pass the results from the `tokenizer` to our `model`. In PyTorch, we use `torch.no_grad()` since we don't need to compute the gradients for backward propagation, we are not training the model in this example.

    ```python theme={null}
    with torch.no_grad():
        outputs = model(**inputs)
    ```

    It is imperative that these outputs contain the activation values of the hidden layers of the model since you will be using them to construct your embeddings. In this scenario, we will use just the last hidden layer.

    ```python theme={null}
    last_hidden_state = outputs.last_hidden_state.cpu().numpy()
    # last_hidden_state.shape = (batch_size, num_tokens, hidden_size)
    ```

    Further, since we want the embedding vector to represent any given token, we will use the vector associated with a specific token in the piece of text as our embedding vector. So, let `token_index` be the integer value that locates the token of interest in the list of tokens that result from passing the piece of text to the `tokenizer`. Let `ex_index` the integer value that locates a given example in the batch. Then,

    ```
    token_embedding = last_hidden_state[ex_index, token_index,:]
    ```
  </Tab>
</Tabs>

### Additional Resources

Check out our tutorials on how to generate embeddings for different use cases using large, pre-trained models.

| Use-Case                                                    | Code                                                                                                                                                                                    |
| ----------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| NLP Multi-Class Sentiment Classification using Hugging Face | [Colab Link](https://colab.research.google.com/github/Arize-ai/tutorials_python/blob/main/Arize_Tutorials/Embeddings/NLP/Arize_Tutorial_NLP_Sentiment_Classification_HuggingFace.ipynb) |
| NLP Multi-Class Sentiment Classification using OpenAI       | [Colab Link](https://colab.research.google.com/github/Arize-ai/tutorials_python/blob/main/Arize_Tutorials/Embeddings/NLP/Arize_Tutorial_NLP_Sentiment_Classification_OpenAI.ipynb)      |
| NLP Named Entity Recognition using Hugging Face             | [Colab Link](https://colab.research.google.com/github/Arize-ai/tutorials_python/blob/main/Arize_Tutorials/Embeddings/NLP/Arize_Tutorial_NLP_Named_Entity_Recognition_HuggingFace.ipynb) |
| CV Image Classification using Hugging Face                  | [Colab Link](https://colab.research.google.com/github/Arize-ai/tutorials_python/blob/main/Arize_Tutorials/Embeddings/CV/Arize_Tutorial_CV_Image_Classification_HuggingFace.ipynb)       |
