I have the following table
ID VALUE1 VALUE2
1 aaa bbb
1 aaa bbb
Sadly, my table has no primmary keys and I want to create it. But first I need to delete only the first occurrance of duplicated rows. I don't have any unique identifier, the rows are exactly equal.
I know that I can use something like this if I have an unique identifier:
DELETE FROM
table
WHERE
table.id < table.id
But, what is the best approach to this if I don't have a unique identifier?
My table has 500 million lines, about 100k are duplicated.
CodePudding user response:
If I understand the question correctly, you can proceed using Row_Number():
with dataset as (select 1 as ID, 'aaa' as VALUE1, 'bbb' as VALUE2
union all select 1, 'aaa', 'bbb'
union all select 2, 'ccc', 'ddd'
union all select 2, 'ccc', 'ddd')
select *, row_number() over (partition by id, value1, value2) from dataset;
This will create the following column row_number:
id | value1 | value2 | row_number
---- -------- -------- ------------
1 | aaa | bbb | 1
1 | aaa | bbb | 2
2 | ccc | ddd | 1
2 | ccc | ddd | 2
and then you can delete * where row_number = 1;