Querying duplicate rows within Oracle Database-CodePudding

I have a table that looks like this.

|  File_ID  |   MD5 Sum   |  File Size  |
| --------- | ----------- | ----------- |
|   140532  |  10000000   |    3000     |
|   192348  |  11111111   |    4000     |
|   223292  |  22222222   |    4000     |
|   272364  |  11111111   |    4000     |
|   223045  |  10000000   |    3000     |

I'd like to see how much space is wasted by duplicate files. The problem is that these duplicate files have unique primary keys (file_id). We know we have duplicates because the count(distinct MD5 sum) != count(*)

I'd like to write a query that returns the total space being used by duplicate files. In this example, the query would return the 7000, because rows with file id 272364 & 223045 are duplicitous.

If anyone can help me with this, it would be much appreciated.

CodePudding user response：

You can produce a row number using the MD5 and then any duplicate will show up with row number above 1.

For example:

select sum(file_size)
from (
  select t.*, row_number() over(partition by md5_sum order by file_id) as rn
  from t
) x
where rn > 1

CodePudding user response：

An alternative to The Impaler's suggestion. But I admit I like their approach better :-)

Group by MD5 sum and look at those that have more than one entry. Then subtract one file size from the sum of filesizes to get the excess. At last add up all those file excess sums.

select sum(excess) as total
from
(
  select md5, sum(filesize) - min(filesize) as excess
  from mytable
  group by md5
  having count(*) > 1
) excess_per_file;