The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].
When calculating any ROUGE metric you get an aggregate result with 3 parameters: low, mid, high. How are these aggregate values calculated?
For example, from the huggingface implementation [2]:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0
CodePudding user response:
Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding use_aggregator=False
and get these values returned.
For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for n
samples, you draw x
times a sample with replacement of size n
, and then calculate some statistic for each resample. Now you get a new distribution called the empirical bootstrap distribution
, which can be used to extract confidence intervals.
In the ROUGE implementation by google [4], they used:
n
for the number of resamples to runmean
for the resample statistic2.5th, 50th and 97.5th percentiles
to calculate the values for low, mid and high, respectively (can be controlled with theconfidence_interval
param)