In the ROUGE metrics, what do the low, mid and high values mean?-CodePudding

The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].

When calculating any ROUGE metric you get an aggregate result with 3 parameters: low, mid, high. How are these aggregate values calculated?

For example, from the huggingface implementation [2]:

>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0

CodePudding user response：

Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding use_aggregator=False and get these values returned.

For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for n samples, you draw x times a sample with replacement of size n, and then calculate some statistic for each resample. Now you get a new distribution called the empirical bootstrap distribution, which can be used to extract confidence intervals.

In the ROUGE implementation by google [4], they used:

n for the number of resamples to run
mean for the resample statistic
2.5th, 50th and 97.5th percentiles to calculate the values for low, mid and high, respectively (can be controlled with the confidence_interval param)