Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
How to export your data for labeling, evaluation, or fine-tuning
How to export your data for labeling, evaluation, or fine-tuning
Embeddings can be extremely useful for fine-tuning. There are two ways to export your embeddings from the Phoenix UI.
To export a cluster (either selected via the lasso tool or via a the cluster list on the right hand panel), click on the export button on the top left of the bottom slide-out.
To export all clusters of embeddings as a single dataframe (labeled by cluster), click the ...
icon on the top right of the screen and click export. Your data will be available either as a Parquet file or is available back in your notebook via your session as a dataframe.
session = px.active_session()
session.exports[-1].dataframe
Quickly explore Phoenix with concrete examples
Phoenix ships with a collection of examples so you can quickly try out the app on concrete use-cases. This guide shows you how to download, inspect, and launch the app with example inferences.
To see a list of inferences available for download, run
px.load_example?
This displays the docstring for the phoenix.load_example
function, which contain a list of inferences available for download.
Choose the name of an inference set to download and pass it as an argument to phoenix.load_example
. For example, run the following to download production and training data for our demo sentiment classification model:
inferences = px.load_example("sentiment_classification_language_drift")
inferences
px.load_example
returns your downloaded data in the form of an ExampleInferences
instance. After running the code above, you should see the following in your cell output.
ExampleInferences(primary=<Inferences "sentiment_classification_language_drift_primary">, reference=<Inferences "sentiment_classification_language_drift_reference">)
Next, inspect the name, dataframe, and schema that define your primary inferences. First, run
prim_ds = inferences.primary
prim_ds.name
to see the name of the inferences in your cell output:
'sentiment_classification_language_drift_primary'
Next, run
prim_ds.schema
to see your inferences' schema in the cell output:
Schema(prediction_id_column_name='prediction_id', timestamp_column_name='prediction_ts', feature_column_names=['reviewer_age', 'reviewer_gender', 'product_category', 'language'], tag_column_names=None, prediction_label_column_name='pred_label', prediction_score_column_name=None, actual_label_column_name='label', actual_score_column_name=None, embedding_feature_column_names={'text_embedding': EmbeddingColumnNames(vector_column_name='text_vector', raw_data_column_name='text', link_to_data_column_name=None)}, excluded_column_names=None)
Last, run
prim_ds.dataframe.info()
to get an overview of your inferences's underlying dataframe:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 33411 entries, 2022-05-01 07:00:16+00:00 to 2022-06-01 07:00:16+00:00
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 prediction_ts 33411 non-null datetime64[ns, UTC]
1 reviewer_age 33411 non-null int16
2 reviewer_gender 33411 non-null object
3 product_category 33411 non-null object
4 language 33411 non-null object
5 text 33411 non-null object
6 text_vector 33411 non-null object
7 label 33411 non-null object
8 pred_label 33411 non-null object
9 prediction_id 0 non-null object
dtypes: datetime64[ns, UTC](1), int16(1), object(8)
memory usage: 2.6+ MB
Launch Phoenix with
px.launch_app(inferences.primary, inferences.reference)
Follow the instructions in the cell output to open the Phoenix UI in your notebook or in a separate browser tab.
Phoenix supports LLM application Traces and has examples that you can take a look at as well.\
px.load_example_traces?
# Load up the LlamaIndex RAG example
px.launch_app(trace=px.load_example_traces("llama_index_rag"))
How to import prompt and response from Large Large Model (LLM)
For the Retrieval-Augmented Generation (RAG) use case, see the Retrieval section.
Below shows a relevant subsection of the dataframe. The embedding
of the prompt is also shown.
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
Neil Alden Armstrong
who was the 15th prime minister of australia
[0.0351, 0.0632, -0.0609, ...
Francis Michael Forde
See Retrieval for the Retrieval-Augmented Generation (RAG) use case where relevant documents are retrieved for the question before constructing the context for the LLM.
primary_schema = Schema(
prediction_id_column_name="id",
prompt_column_names=EmbeddingColumnNames(
vector_column_name="embedding",
raw_data_column_name="prompt",
)
response_column_names="response",
)
Define the inferences by pairing the dataframe with the schema.
primary_inferences = px.Inferences(primary_dataframe, primary_schema)
session = px.launch_app(primary_inferences)
How to import data for the Retrieval-Augmented Generation (RAG) use case
In Retrieval-Augmented Generation (RAG), the retrieval step returns from a (proprietary) knowledge base (a.k.a. ) a list of documents relevant to the user query, then the generation step adds the retrieved documents to the prompt context to improve response accuracy of the Large Language Model (LLM). The IDs of the retrieval documents along with the relevance scores, if present, can be imported into Phoenix as follows.
Below shows only the relevant subsection of the dataframe. The retrieved_document_ids
should matched the id
s in the data. Note that for each row, the list under the relevance_scores
column have a matching length as the one under the retrievals
column. But it's not necessary for all retrieval lists to have the same length.
Both the retrievals
and scores
are grouped under prompt_column_names
along with the embedding
of the query
.
Define the inferences by pairing the dataframe with the schema.
How to create Phoenix inferences and schemas for the corpus data
In , a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. Corpus inferences can be imported into Phoenix as shown below.
Below is an example dataframe containing Wikipedia articles along with its embedding vector.
Below is an appropriate schema for the dataframe above. It specifies the id
column and that embedding
belongs to text
. Other columns, if exist, will be detected automatically, and need not be specified by the schema.
Define the inferences by pairing the dataframe with the schema.
The launcher accepts the corpus dataset through corpus=
parameter.
who was the first person that walked on the moon
[-0.0126, 0.0039, 0.0217, ...
[7395, 567965, 323794, ...
[11.30, 7.67, 5.85, ...
who was the 15th prime minister of australia
[0.0351, 0.0632, -0.0609, ...
[38906, 38909, 38912, ...
[11.28, 9.10, 8.39, ...
why is amino group in aniline an ortho para di...
[-0.0431, -0.0407, -0.0597, ...
[779579, 563725, 309367, ...
[-10.89, -10.90, -10.94, ...
primary_schema = Schema(
prediction_id_column_name="id",
prompt_column_names=RetrievalEmbeddingColumnNames(
vector_column_name="embedding",
raw_data_column_name="query",
context_retrieval_ids_column_name="retrieved_document_ids",
context_retrieval_scores_column_name="relevance_scores",
)
)
primary_inferences = px.Inferences(primary_dataframe, primary_schema)
session = px.launch_app(primary_inferences)
1
Voyager 2 is a spacecraft used by NASA to expl...
[-0.02785328, -0.04709944, 0.042922903, 0.0559...
2
The Staturn Nebula is a planetary nebula in th...
[0.03544901, 0.039175965, 0.014074919, -0.0307...
3
Eris is a dwarf planet and a trans-Neptunian o...
[0.05506449, 0.0031612846, -0.020452883, -0.02...
corpus_schema = px.Schema(
id_column_name="id",
document_column_names=EmbeddingColumnNames(
vector_column_name="embedding",
raw_data_column_name="text",
),
)
corpus_inferences = px.Inferences(corpus_dataframe, corpus_schema)
session = px.launch_app(production_dataset, corpus=corpus_inferences)
How to create Phoenix inferences and schemas for common data formats
This guide shows you how to define Phoenix inferences using your own data.
Once you have a pandas dataframe df
containing your data and a schema
object describing the format of your dataframe, you can define your Phoenix dataset either by running
ds = px.Inferences(df, schema)
or by optionally providing a name for your dataset that will appear in the UI:
ds = px.Inferences(df, schema, name="training")
As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.
Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are datetime.datetime
objects that represent the time at which each inference was made in production.
2023-03-01 02:02:19
0.91
click
click
2023-02-17 23:45:48
0.37
no_click
no_click
2023-01-30 15:30:03
0.54
click
no_click
2023-02-03 19:56:09
0.74
click
click
2023-02-24 04:23:43
0.37
no_click
click
schema = px.Schema(
timestamp_column_name="timestamp",
prediction_score_column_name="prediction_score",
prediction_label_column_name="prediction",
actual_label_column_name="target",
)
This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.
Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.
578
Scammeds
4300
62966
RENT
110
0
0
25
male
not_fraud
fraud
507
Schiller Ltd
21000
52335
RENT
129
0
23
78
female
not_fraud
not_fraud
656
Kirlin and Sons
18000
94995
MORTGAGE
31
0
0
54
female
uncertain
uncertain
414
Scammeds
18000
32034
LEASE
81
2
0
34
male
fraud
not_fraud
512
Champlin and Sons
20000
46005
OWN
148
1
0
49
male
uncertain
uncertain
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
feature_column_names=[
"fico_score",
"merchant_id",
"loan_amount",
"annual_income",
"home_ownership",
"num_credit_lines",
"inquests_in_last_6_months",
"months_since_last_delinquency",
],
tag_column_names=[
"age",
"gender",
],
)
If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the feature_column_names
field of your schema set to its default value of None
, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.
malignant
benign
15.49
19.97
102.40
744.7
0.11600
0.15620
0.18910
0.09113
0.1929
0.06744
0.6470
1.3310
4.675
66.91
0.007269
0.02928
0.04972
0.01639
0.01852
0.004232
21.20
29.41
142.10
1359.0
0.1681
0.3913
0.55530
0.21210
0.3187
0.10190
malignant
malignant
17.01
20.26
109.70
904.3
0.08772
0.07304
0.06950
0.05390
0.2026
0.05223
0.5858
0.8554
4.106
68.46
0.005038
0.01503
0.01946
0.01123
0.02294
0.002581
19.80
25.05
130.00
1210.0
0.1111
0.1486
0.19320
0.10960
0.3275
0.06469
malignant
malignant
17.99
10.38
122.80
1001.0
0.11840
0.27760
0.30010
0.14710
0.2419
0.07871
1.0950
0.9053
8.589
153.40
0.006399
0.04904
0.05373
0.01587
0.03003
0.006193
25.38
17.33
184.60
2019.0
0.1622
0.6656
0.71190
0.26540
0.4601
0.11890
benign
benign
14.53
13.98
93.86
644.2
0.10990
0.09242
0.06895
0.06495
0.1650
0.06121
0.3060
0.7213
2.143
25.70
0.006133
0.01251
0.01615
0.01136
0.02207
0.003563
15.80
16.93
103.10
749.9
0.1347
0.1478
0.13730
0.10690
0.2606
0.07810
benign
benign
10.26
14.71
66.20
321.6
0.09882
0.09159
0.03581
0.02037
0.1633
0.07005
0.3380
2.5090
2.394
19.33
0.017360
0.04671
0.02611
0.01296
0.03675
0.006758
10.88
19.48
70.89
357.1
0.1360
0.1636
0.07162
0.04074
0.2434
0.08488
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
)
You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the excluded_column_names
field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.
malignant
benign
Pacific Clinics
uninsured
15.49
19.97
102.40
744.7
0.11600
0.15620
0.18910
0.09113
0.1929
0.06744
0.6470
1.3310
4.675
66.91
0.007269
0.02928
0.04972
0.01639
0.01852
0.004232
21.20
29.41
142.10
1359.0
0.1681
0.3913
0.55530
0.21210
0.3187
0.10190
malignant
malignant
Queens Hospital
Anthem Blue Cross
17.01
20.26
109.70
904.3
0.08772
0.07304
0.06950
0.05390
0.2026
0.05223
0.5858
0.8554
4.106
68.46
0.005038
0.01503
0.01946
0.01123
0.02294
0.002581
19.80
25.05
130.00
1210.0
0.1111
0.1486
0.19320
0.10960
0.3275
0.06469
malignant
malignant
St. Francis Memorial Hospital
Blue Shield of CA
17.99
10.38
122.80
1001.0
0.11840
0.27760
0.30010
0.14710
0.2419
0.07871
1.0950
0.9053
8.589
153.40
0.006399
0.04904
0.05373
0.01587
0.03003
0.006193
25.38
17.33
184.60
2019.0
0.1622
0.6656
0.71190
0.26540
0.4601
0.11890
benign
benign
Pacific Clinics
Kaiser Permanente
14.53
13.98
93.86
644.2
0.10990
0.09242
0.06895
0.06495
0.1650
0.06121
0.3060
0.7213
2.143
25.70
0.006133
0.01251
0.01615
0.01136
0.02207
0.003563
15.80
16.93
103.10
749.9
0.1347
0.1478
0.13730
0.10690
0.2606
0.07810
benign
benign
CityMed
Anthem Blue Cross
10.26
14.71
66.20
321.6
0.09882
0.09159
0.03581
0.02037
0.1633
0.07005
0.3380
2.5090
2.394
19.33
0.017360
0.04671
0.02611
0.01296
0.03675
0.006758
10.88
19.48
70.89
357.1
0.1360
0.1636
0.07162
0.04074
0.2434
0.08488
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
excluded_column_names=[
"hospital",
"insurance_provider",
],
)
Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use px.EmbeddingColumnNames
to associate multiple dataframe columns with the same embedding feature.
To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the vector_column_name
field on px.EmbeddingColumnNames
. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:
Unlike other fields that take strings or lists of strings, the argument to embedding_feature_column_names
is a dictionary.
The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
The values of this dictionary are instances of px.EmbeddingColumnNames
.
Each entry in the "embedding_vector" column is a list of length 4.
fraud
not_fraud
[-0.97, 3.98, -0.03, 2.92]
604
Leannon Ward
22000
100781
RENT
108
0
0
fraud
not_fraud
[3.20, 3.95, 2.81, -0.09]
612
Scammeds
7500
116184
MORTGAGE
42
2
56
not_fraud
not_fraud
[-0.49, -0.62, 0.08, 2.03]
646
Leannon Ward
32000
73666
RENT
131
0
0
not_fraud
not_fraud
[1.69, 0.01, -0.76, 3.64]
560
Kirlin and Sons
19000
38589
MORTGAGE
131
0
0
uncertain
uncertain
[1.46, 0.69, 3.26, -0.17]
636
Champlin and Sons
10000
100251
MORTGAGE
10
0
3
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
embedding_feature_column_names={
"transaction_embeddings": px.EmbeddingColumnNames(
vector_column_name="embedding_vector"
),
},
)
To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.
If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the link_to_data_column_name
field on px.EmbeddingColumnNames
. The following example contains data for an image classification model that detects product defects on an assembly line.
okay
https://www.example.com/image0.jpeg
[1.73, 2.67, 2.91, 1.79, 1.29]
defective
https://www.example.com/image1.jpeg
[2.18, -0.21, 0.87, 3.84, -0.97]
okay
https://www.example.com/image2.jpeg
[3.36, -0.62, 2.40, -0.94, 3.69]
defective
https://www.example.com/image3.jpeg
[2.77, 2.79, 3.36, 0.60, 3.10]
okay
https://www.example.com/image4.jpeg
[1.79, 2.06, 0.53, 3.58, 0.24]
schema = px.Schema(
actual_label_column_name="defective",
embedding_feature_column_names={
"image_embedding": px.EmbeddingColumnNames(
vector_column_name="image_vector",
link_to_data_column_name="image",
),
},
)
For local image data, we recommend the following steps to serve your images via a local HTTP server:
In your terminal, navigate to a directory containing your image data and run python -m http.server 8000
.
Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.
For example, suppose your HTTP server is running in a directory with the following contents:
.
└── image-data
└── example_image.jpeg
Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.
If your embeddings represent pieces of text, you can display that text in the app by using the raw_data_column_name
field on px.EmbeddingColumnNames
. The embeddings below were generated by a sentiment classification model trained on product reviews.
Magic Lamp
Makes a great desk lamp!
[2.66, 0.89, 1.17, 2.21]
office
positive
Ergo Desk Chair
This chair is pretty comfortable, but I wish it had better back support.
[3.33, 1.14, 2.57, 2.88]
office
neutral
Cloud Nine Mattress
I've been sleeping like a baby since I bought this thing.
[2.5, 3.74, 0.04, -0.94]
bedroom
positive
Dr. Fresh's Spearmint Toothpaste
Avoid at all costs, it tastes like soap.
[1.78, -0.24, 1.37, 2.6]
personal_hygiene
negative
Ultra-Fuzzy Bath Mat
Cheap quality, began fraying at the edges after the first wash.
[2.71, 0.98, -0.22, 2.1]
bath
negative
schema = px.Schema(
actual_label_column_name="sentiment",
feature_column_names=[
"category",
],
tag_column_names=[
"name",
],
embedding_feature_column_names={
"product_review_embeddings": px.EmbeddingColumnNames(
vector_column_name="text_vector",
raw_data_column_name="text",
),
},
)
Sometimes it is useful to have more than one embedding feature. The example below shows a multi-modal application in which one embedding represents the textual description and another embedding represents the image associated with products on an e-commerce site.
Magic Lamp
Enjoy the most comfortable setting every time for working, studying, relaxing or getting ready to sleep.
[2.47, -0.01, -0.22, 0.93]
https://www.example.com/image0.jpeg
[2.42, 1.95, 0.81, 2.60, 0.27]
Ergo Desk Chair
The perfect mesh chair, meticulously developed to deliver maximum comfort and high quality.
[-0.25, 0.07, 2.90, 1.57]
https://www.example.com/image1.jpeg
[3.17, 2.75, 1.39, 0.44, 3.30]
Cloud Nine Mattress
Our Cloud Nine Mattress combines cool comfort with maximum affordability.
[1.36, -0.88, -0.45, 0.84]
https://www.example.com/image2.jpeg
[-0.22, 0.87, 1.10, -0.78, 1.25]
Dr. Fresh's Spearmint Toothpaste
Natural toothpaste helps remove surface stains for a brighter, whiter smile with anti-plaque formula
[-0.39, 1.29, 0.92, 2.51]
https://www.example.com/image3.jpeg
[1.95, 2.66, 3.97, 0.90, 2.86]
Ultra-Fuzzy Bath Mat
The bath mats are made up of 1.18-inch height premium thick, soft and fluffy microfiber, making it great for bathroom, vanity, and master bedroom.
[0.37, 3.22, 1.29, 0.65]
https://www.example.com/image4.jpeg
[0.77, 1.79, 0.52, 3.79, 0.47]
schema = px.Schema(
tag_column_names=["name"],
embedding_feature_column_names={
"description_embedding": px.EmbeddingColumnNames(
vector_column_name="description_vector",
raw_data_column_name="description",
),
"image_embedding": px.EmbeddingColumnNames(
vector_column_name="image_vector",
link_to_data_column_name="image",
),
},
)
How to define your inference set(s), launch a session, open the UI in your notebook or browser, and close your session when you're done
To define inferences, you must load your data into a pandas dataframe and create a matching schema. If you have a dataframe prim_df
and a matching prim_schema
, you can define inferences named "primary" with
prim_ds = px.Inferences(prim_df, prim_schema, "primary")
If you additionally have a dataframe ref_df
and a matching ref_schema
, you can define a inference set named "reference" with
ref_ds = px.Inferences(ref_df, ref_schema, "reference")
See Corpus Data if you have corpus data for an Information Retrieval use case.
Use phoenix.launch_app
to start your Phoenix session in the background. You can launch Phoenix with zero, one, or two inference sets.
You can view and interact with the Phoenix UI either directly in your notebook or in a separate browser tab or window.
In a notebook cell, run
session.url
Copy and paste the output URL into a new browser tab or window.
In a notebook cell, run
session.view()
The Phoenix UI will appear in an inline frame in the cell output.
When you're done using Phoenix, gracefully shut down your running background session with
px.close_app()
No Inferences
session = px.launch_app()
Run Phoenix in the background to collect OpenInference traces emitted by your instrumented LLM application.
Single Inference Set
session = px.launch_app(ds)
Analyze a single cohort of data, e.g., only training data.
Check model performance and data quality, but not drift.
Primary and Reference Inference Sets
session = px.launch_app(prim_ds, ref_ds)
Compare cohorts of data, e.g., training vs. production.
Analyze drift in addition to model performance and data quality.
Primary and Corpus Inference Sets
session = px.launch_app(query_ds, corpus=corpus_ds)
Compare a query inference set to a corpus dataset to analyze your retrieval-augmented generation applications.
Phoenix supports any type of dense embedding generated for almost any type of data.
But what if I don't have embeddings handy? Well, that is not a problem. The model data can be analyzed by the embeddings Auto-Generated for Phoenix.
Generating embeddings is likely another problem to solve, on top of ensuring your model is performing properly. With our Python , you can offload that task to the SDK and we will generate the embeddings for you. We use large, pre-trained models that will capture information from your inputs and encode it into embedding vectors.
We support generating embeddings for you for the following types of data:
CV - Computer Vision
NLP - Natural Language
Tabular Data - Pandas Dataframes
We extract the embeddings in the appropriate way depending on your use case, and we return it to you to include in your pandas dataframe, which you can then analyze using Phoenix.
Auto-Embeddings works end-to-end, you don't have to worry about formatting your inputs for the correct model. By simply passing your input, an embedding will come out as a result. We take care of everything in between.
If you want to use this functionality as part of our Python SDK, you need to install it with the extra dependencies using pip install arize[AutoEmbeddings]
.
You can get an updated table listing of supported models by running the line below.
from arize.pandas.embeddings import EmbeddingGenerator
EmbeddingGenerator.list_pretrained_models()
Auto-Embeddings is designed to require minimal code from the user. We only require two steps:
Create the generator: you simply instantiate the generator using EmbeddingGenerator.from_use_case()
and passing information about your use case, the model to use, and more options depending on the use case; see examples below.
Let Arize generate your embeddings: obtain your embeddings column by calling generator.generate_embedding()
and passing the column containing your inputs; see examples below.
Arize expects the dataframe's index to be sorted and begin at 0. If you perform operations that might affect the index prior to generating embeddings, reset the index as follows:
df = df.reset_index(drop=True)
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
df = df.reset_index(drop=True)
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.CV.IMAGE_CLASSIFICATION,
model_name="google/vit-base-patch16-224-in21k",
batch_size=100
)
df["image_vector"] = generator.generate_embeddings(
local_image_path_col=df["local_path"]
)
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
df = df.reset_index(drop=True)
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.NLP.SEQUENCE_CLASSIFICATION,
model_name="distilbert-base-uncased",
tokenizer_max_length=512,
batch_size=100
)
df["text_vector"] = generator.generate_embeddings(text_col=df["text"])
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
df = df.reset_index(drop=True)
# Instantiate the embeddding generator
generator = EmbeddingGeneratorForTabularFeatures(
model_name="distilbert-base-uncased",
tokenizer_max_length=512
)
# Select the columns from your dataframe to consider
selected_cols = [...]
# (Optional) Provide a mapping for more verbose column names
column_name_map = {...: ...}
# Generate tabular embeddings and assign them to a new column
df["tabular_embedding_vector"] = generator.generate_embeddings(
df,
selected_columns=selected_cols,
col_name_map=column_name_map # (OPTIONAL, can remove)
)