Monitoring Text-Based Generative AI Models Using Metrics Like Bleu Score

hakan tekgul arize team member
Hakan Tekgul,  ML Solutions Engineer  | Published March 03, 2023

Tips for measuring text-based generative models using BLEU, ROUGE, METEOR, and Bertscore as well as prediction embeddings

In recent years, text-based generative AI models have been making significant strides in natural language processing tasks such as language translation, text summarization, and dialogue generation. These models are capable of generating text that is often indistinguishable from human-generated text, making them increasingly popular in various industries, including customer service, content generation, and data analysis. While these models can be incredibly powerful and useful, they can also produce unexpected or even harmful output, making it critical to monitor them closely.

Monitoring language models

For example, consider a chatbot that is designed to help customers with their queries. If the model is not monitored, it could generate inappropriate or unhelpful responses, damaging the reputation of the company that deployed it. Therefore, it is essential to monitor these models’ performance regularly to ensure that they are producing accurate and unbiased results. In this article, we will deep dive on how to monitor text-based generative models using performance metrics such as BLEU, ROUGE, METEOR scores, and prediction embeddings.

Monitoring Generative Models with Reference Text

In order to evaluate the performance of machine-generated text, a reference text or ground truth is used for comparison. This reference text is what is expected from the generative model to produce ideally and usually collected from human domain experts. In the case that the reference text exists as models generate prompts, there are different metrics to compute performance. Let’s try to understand the different types of performance metrics for generative models with real-life examples in Python.

BLEU Score: Bilingual Evaluation Understudy

BLEU is a precision-focused metric that measures the n-gram overlap between the generated text and the reference text. The score also considers a brevity penalty where a penalty is applied when the machine-generated text is too short compared to reference text. It is a metric that is generally used for machine translation performance. The score ranges from 0 to 1, with higher scores indicating greater similarity between the generated text and the reference text.

bleu score formula how to calculate

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'movie', 'was', 'awesome']]
candidate = ['this', 'movie', 'was', 'awesome', 'too']
score = sentence_bleu(reference, candidate)

ROUGE Score: Recall Oriented Understudy for Gisting Evaluation Score

ROUGE is a metric that measures the overlap between the generated text and the reference text in terms of recall. Rouge comes in three types: rouge-n, the most prevalent form that detects n-gram overlap; rouge-l, which identifies the Longest Common Subsequence and rouge-s, which concentrates on skip grams. n-rouge is the most frequently used type with the following formula:

rouge score formula

The following code demonstrates how to calculate the rouge-2 score in Python:

from rouge import Rouge
reference = 'this movie was awesome'
candidate = 'this movie was awesome too'
rouge = Rouge()
scores = rouge.get_scores(candidate, reference)[0]

The main difference between rouge and bleu is that bleu score is precision-focused whereas rouge score focuses on recall.

METEOR Score: Metric for Evaluation of Translation with Explicit Ordering

METEOR is a metric that measures the quality of generated text based on the alignment between the generated text and the reference text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. You can check out the algorithm behind METEOR here.

The following code demonstrates how to calculate the METEOR score using the NLTK library in Python:

from nltk.translate.meteor_score import single_meteor_score
reference = ['this', 'movie', 'was', 'awesome']
candidate = ['this', 'movie', 'was', 'awesome', 'too']
score = single_meteor_score(reference, candidate)

While the main difference between rouge and bleu is that bleu score is precision-focused and ROUGE score focuses on recall, the METEOR metric on the other hand was designed to fix some of the problems found in the more popular BLEU and ROUGE metrics and also produce good correlation with human judgment at the sentence or segment level.

BERT Score

One main disadvantage of using metrics such as BLEU or ROUGE is the fact that the performance of text generation models are dependent on exact matches. Exact matches might be important for use-cases like machine translation, however for generative AI models that try to create meaningful and similar texts to corpus data, exact matches might not be very accurate.

what is bertscore

Hence, instead of exact matches, BERTScore is focused on the similarity between reference and generated text by using contextual embeddings. The main idea behind contextual embeddings is to understand the meaning behind the reference and candidate text respectively and then compare those meanings.

The following code demonstrates how to calculate the BERT score using the bert_score library in Python:

import torch
from bert_score import score

# reference and generated texts
ref_text = "The quick brown fox jumps over the lazy dog."
gen_text = "A fast brown fox leaps over a lazy hound."

# compute Bert score
P, R, F1 = score([gen_text], [ref_text], lang="en", model_type="bert-base-uncased")

# print results
print(f"Bert score: P={P.item():.4f} R={R.item():.4f} F1={F1.item():.4f}")

Monitoring Generative Models without Reference Text

When generative models are generating text without any reference, it can be challenging to monitor the models since performance metrics such as ROUGE or METEOR can not be computed. However, just like non-generative models, proxy metrics such as drift can be used to monitor generative models. In this case, since the models’ outputs are text, text embeddings can be leveraged to track the change in predictions. These embeddings provide a representation of the model’s output and can be used to compare different outputs to identify changes in the model’s behavior over time.

Specifically, euclidean distance between prediction embeddings can be computed in order to track model change over time. However, just simply tracking model drift might not be enough to improve the model performance and make sure model behavior is consistent. As an additional step, the computed prediction embeddings can be visualized in a lower-dimensional space where predictions inside similar clusters would suggest similar semantic meaning. If there are any outlier points inside the lower-dimensional space, those points can be analyzed and might be even used for re-training purposes. Finally, with embedding visualizations, machine learning engineers can also block certain clusters of words so that generative models are not biased.

To demonstrate the use of prediction embeddings, let’s consider an example of a language model trained on a dataset of news articles. Suppose we have a model that produces an article about politics, and we want to compare its output to another article produced by the same model six months later.

First, we can use the transformer library to tokenize the two articles:

import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

article1 = "The government announced new policies on climate change."
article2 = "The government released a statement on their new policies for combating climate change."

inputs1 = tokenizer.encode(article1, return_tensors="pt")
inputs2 = tokenizer.encode(article2, return_tensors="pt")

Next, we can generate the embeddings for the two articles using the model’s output:

with torch.no_grad():
outputs1 = model(inputs1)
embeddings1 = np.array(outputs1.last_hidden_state.mean(axis=1).squeeze())

outputs2 = model(inputs2)
embeddings2 = np.array(outputs2.last_hidden_state.mean(axis=1).squeeze())

Finally, we can use the mean of the model’s last hidden state as the embedding for each article. We can then calculate the euclidean distance between the two embeddings to compare the two articles:

from scipy.spatial.distance import euclidean

distance = euclidean(embeddings1, embeddings2)

Now, we can compute the euclidean distance over all of our prediction embeddings over time and see if our model behavior is changing or not. The Arize ML Observability platform can automatically generate embeddings out of your generated text models and compute euclidean distances over time.

arize example monitor generative ai

By clicking on a specific point in time, you can also see Arize’s UMAP visualization of your embeddings in a lower-dimensional space. Additionally, Arize will automatically find clusters with similar semantic meaning where users can highlight the different clusters and extract insights from each cluster.

umap example monitoring generative ai

Finally, if you would like to learn more about how Arize enables observability for generative AI models in real-life, you can check out an article from a generative AI company here.

Next Steps of Generative Model Monitoring

Even though traditional metrics such as BLEU, ROUGE, or METEOR can be promising for model performance monitoring, using large language models (LLMs) such as BERT is expected to be more common in the next few years. Using LLMs to evaluate LLMs on complex tasks is an emerging area of research that aims to enhance the performance of language models. The use of LLMs for evaluation can be advantageous since they can capture complex patterns and dependencies within large datasets that traditional evaluation metrics may overlook. Additionally, LLMs can be trained on a wide range of tasks, which can aid in the evaluation of other LLMs across multiple domains. As an example, LangChain provides some chains/prompts to evaluate question answering tasks  by using LLMs.


In conclusion, monitoring of text-based generative AI models is a crucial process that ensures their performance and fairness over time. Using performance metrics such as BLEU, ROUGE, and METEOR scores, we can evaluate the quality of the model’s output and track changes in its behavior. Additionally, prediction embeddings are a valuable tool for identifying drift and monitoring embedding drift, which can help improve the model’s accuracy and fairness. However, there are limitations to generative AI model monitoring, and additional measures such as diverse training data and regular retraining may be necessary to ensure model performance. Overall, by incorporating monitoring techniques and best practices, we can ensure the continued success of text-based generative AI models in a variety of applications, from chatbots to content generation and beyond.