Home > database >  In the ROUGE metrics, what do the low, mid and high values mean?
In the ROUGE metrics, what do the low, mid and high values mean?

Time:06-23

The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].

When calculating any ROUGE metric you get an aggregate result with 3 parameters: low, mid, high. How are these aggregate values calculated?

For example, from the huggingface implementation [2]:

>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0

CodePudding user response:

Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding use_aggregator=False and get these values returned.

For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for n samples, you draw x times a sample with replacement of size n, and then calculate some statistic for each resample. Now you get a new distribution called the empirical bootstrap distribution, which can be used to extract confidence intervals.

In the ROUGE implementation by google [4], they used:

  • n for the number of resamples to run
  • mean for the resample statistic
  • 2.5th, 50th and 97.5th percentiles to calculate the values for low, mid and high, respectively (can be controlled with the confidence_interval param)
  • Related