I'm having hard time removing duplicates from database. It's MariaDB (protocol version: 10, 10.3.34-MariaDB Server). I need to remove rows where three columns are equal. I was trying to use WITH clause but database throws error that it can't recognize 'WITH', so I focused on traditional way.
I need to remove rows where foreignId
, column1
and column2
are equal.
I'm checking if there are duplicates like
SELECT foreignId, column1, column2, COUNT(*)
FROM table1
GROUP BY foreignId, column1, column2
HAVING COUNT(*) > 1
Trying to remove duplicates...
DELETE table1
FROM table1
INNER JOIN (
SELECT
p.id,
p.foreignId,
p.column1,
p.column2,
ROW_NUMBER() OVER (
PARTITION BY
p.column1,
p.column2,
p.foreignId
ORDER BY
p.foreignId,
p.column2,
p.column1
) AS row_number
FROM table1 p
GROUP BY p.foreignId, p.column1, p.column2
) dup
ON table1.column1 = dup.column1
WHERE dup.row_number > 1;
I was modifying this code alot but still can't make it work as intended... What am I doing wrong?
CodePudding user response:
You have a few issues with your query:
- You need to remove the
GROUP BY
in the subquery - You should change the
ORDER BY
in theOVER
clause toORDER BY p.ts DESC
(wherets
is the name of your timestamp column) - You need to
JOIN
on the uniqueid
column; otherwise you will delete any row which has values which have duplicates anywhere i.e.ON table.id = dup.id
That will give you:
DELETE table1
FROM table1
INNER JOIN (
SELECT
p.id,
ROW_NUMBER() OVER (
PARTITION BY
p.column1,
p.column2,
p.foreignId
ORDER BY
p.ttimestamp DESC
) AS rn
FROM table1 p
) dup
ON table1.id = dup.id
WHERE dup.rn > 1
Note I would not use row_number
as a column alias as it is a reserved word, so I've changed it to rn
above.
Demo (thanks to @JonasMetzler) on dbfiddle
Note that if it's possible for duplicate rows to also have the same timestamp value, this query will delete a random selection of those rows. If you want a deterministic result, change the ORDER BY
clause to
ORDER BY
p.ttimestamp DESC,
p.id DESC
which will keep the row with the highest (or lowest if you remove the DESC
after p.id
) id
value.
CodePudding user response:
Assuming you have a unique column like id, you can do following:
DELETE FROM table1 WHERE ID NOT IN
(SELECT x.id FROM
(SELECT MAX(id) id, MAX(foreignId) foreignId,
MAX(column1) column1, MAX(column2) column2
FROM table1
WHERE ttimestamp IN (SELECT MAX(ttimestamp) FROM table1
GROUP BY foreignID, column1, column2)
GROUP BY foreignId, column1, column2)x);
Please see the working example here: db<>fiddle