Evaluating Model Performance with the ROUGE Metric: A Comprehensive Guide
Evaluating text quality can be extremely tricky, yet it’s foundational to any use of generative AI models. But although assessing the quality of text might be easy for humans, it’s usually much harder for machines. One reason is that the criteria that determine the quality of text are not well defined. Furthermore, those criteria might change, depending on the text’s purpose and context.
This blog post is the first in a series in which we will dive deep into current methods for evaluating the quality of machine learning model outputs, and present the different evaluation metrics that are currently available. See also our BLEU metric and GRUEN metric deep dives.
In this post, we’ll present one method for evaluating the outputs of summary generation models: the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric.
This metric is especially useful when evaluating a machine-generated summary of a long text. Assume we are trying to evaluate the quality of an ML model that generates such summaries. In this case, a high-quality summary might be defined as one that preserves the most important information that appears in the reference text.
It’s important to note that while a high score for the text-summarization algorithm means it’s able to preserve the most important information, meeting this criterion doesn’t necessarily mean that the text is high quality. For example, the algorithm might preserve the most important information while also generating some toxic or biased text. In other words, it might violate some additional quality-related constraints we want the generated output to respect. Therefore, the problem of evaluating the quality of generated text is a multidimensional task, with the specific dimensions defined by the specific application at hand.
But we’ll try to keep things simple for this post and we’ll focus only on the amount of important information being preserved. As we’ll show next, this can be quantified with the ROUGE metric.
Understanding the ROUGE Metric
ROUGE is a set of metrics originally designed for evaluating text-summarization algorithms. The metrics compare an automatically produced summary against a human-produced reference summary.
For illustration, consider the following human-produced reference (denoted with R) and a model-generated text (denoted with C):
R: Dan loves chocolate cakes
C: Dan loves chocolate chip cookies
ROUGE-N
The ROUGE-n score measures the overlap of n-grams (sometimes referred to as gram length) in the reference and the generated text. This means the number of n consecutive words that appear both in the evaluated text and the measured text. Formally, we define:
ROUGE[n]-recall = # n-grams that appear in both R and C / number of n-grams in R
ROUGE[n]-precision= # n-grams that appear in both R and C / number of n-grams in C
The ROUGE[n]-F1 score is then defined as:
2* ROUGE[n]-recall* ROUGE[n]-precision/( ROUGE[n]-recall + ROUGE[n]-precision)
Back to our example:
ROUGE-1:
ROUGE[1]-recall = 3/4 = 0.75
ROUGE[1]-precision = 3/5 = 0.6
ROUGE[n]-F1 = 2*(0.75*0.6)/(0.75+0.6) = 0.66
ROUGE-2:
ROUGE[2]-recall = 2/3 = 0.66
ROUGE[2]-precision = 2/4 = 0.5
ROUGE[2]-F1 = 2*(0.66*0.5)/(0.66+0.5) = 0.56
ROUGE-L
While ROUGE-N is based on the overlap of n-consecutive words in both the reference and the automatically produced text, ROUGE-L considers the longest common subsequence of words (LCS) — even if they aren’t consecutive, but still in order. For instance, in our example, the LCS is “Dan loves chocolate” — and has a length of 3. Therefore:
ROUGE-L-recall = ¾ = 0.75
ROUGE-L-precision = 3/5 = 0.6
ROUGE-L-F1 = 0.66
ROUGE-S (n)
ROUGE-S is a skip-gram concurrence metric: it considers n-grams that appear in the reference text and allows the words to be separated by one or more words in the model output (but they must still appear in order).
For example, consider the following sentences:
R: Dan loves chocolate cakes
C: Dan loves chocolate chip cookies and cakes
Consider the 2-gram “chocolate cakes”: the ROUGE-2 metric will match it only if it appears in C exactly. In this example, this specific 2-gram does not match any 2-gram in C; therefore, it has 0 contribution to the ROUGE-2 calculation. However, using ROUGE-S with 3-gram skipping, “chocolate cakes” in R will be matched with “chocolate chip cookies and cakes” in C.
ROUGE-S recall, precision, and F1-score can be computed in the same way as the standard ROUGE score.
Which ROUGE metric should I use?
That’s highly dependent on your specific use case. Rouge-n has stronger “requirements” than the others, but that restrictiveness has a cost - it will be less tolerant of small (acceptable) changes, like the use of synonyms or different word orders.
Pros and Cons of ROUGE
Pros – computationally, calculating ROUGE is inexpensive and fast. In addition, it can be used with any input language.
Cons – ROUGE considers syntactic rather than semantic matches. This fact might have several implications: for example, it does not take into account words that have the same meaning. Also, it requires a reference evaluation, which means it cannot be used for automatic evaluation of generated outputs.
Final Words
As we mentioned earlier, no single metric tells the whole story. In other words, different metrics measure different aspects of quality. Therefore, it’s essential to consider different metrics when evaluating text. Stay tuned — in our next post, we’ll consider some crucial complementary metrics!