Can't remove duplicates from MariaDB 10-CodePudding

I'm having hard time removing duplicates from database. It's MariaDB (protocol version: 10, 10.3.34-MariaDB Server). I need to remove rows where three columns are equal. I was trying to use WITH clause but database throws error that it can't recognize 'WITH', so I focused on traditional way.

I need to remove rows where foreignId, column1 and column2 are equal.

I'm checking if there are duplicates like

SELECT foreignId, column1, column2, COUNT(*)
FROM table1
GROUP BY foreignId, column1, column2
HAVING COUNT(*) > 1

Trying to remove duplicates...

DELETE table1
FROM table1
INNER JOIN (
  SELECT
    p.id,
    p.foreignId,
    p.column1,
    p.column2,
    ROW_NUMBER() OVER (
      PARTITION BY 
        p.column1, 
        p.column2, 
        p.foreignId
      ORDER BY
        p.foreignId,
        p.column2,
        p.column1
    ) AS row_number
  FROM table1 p
  GROUP BY p.foreignId, p.column1, p.column2
) dup
ON table1.column1 = dup.column1
WHERE dup.row_number > 1;

I was modifying this code alot but still can't make it work as intended... What am I doing wrong?

CodePudding user response：

You have a few issues with your query:

You need to remove the GROUP BY in the subquery
You should change the ORDER BY in the OVER clause to ORDER BY p.ts DESC (where ts is the name of your timestamp column)
You need to JOIN on the unique id column; otherwise you will delete any row which has values which have duplicates anywhere i.e. ON table.id = dup.id

That will give you:

DELETE table1
FROM table1
INNER JOIN (
  SELECT
    p.id,
    ROW_NUMBER() OVER (
      PARTITION BY 
        p.column1, 
        p.column2, 
        p.foreignId
      ORDER BY
        p.ttimestamp DESC
    ) AS rn
  FROM table1 p
) dup
ON table1.id = dup.id
WHERE dup.rn > 1

Note I would not use row_number as a column alias as it is a reserved word, so I've changed it to rn above.

Demo (thanks to @JonasMetzler) on dbfiddle

Note that if it's possible for duplicate rows to also have the same timestamp value, this query will delete a random selection of those rows. If you want a deterministic result, change the ORDER BY clause to

      ORDER BY
        p.ttimestamp DESC,
        p.id DESC

which will keep the row with the highest (or lowest if you remove the DESC after p.id) id value.

Demo on dbfiddle

CodePudding user response：

Assuming you have a unique column like id, you can do following:

DELETE FROM table1 WHERE ID NOT IN 
(SELECT x.id FROM
(SELECT MAX(id) id, MAX(foreignId) foreignId, 
MAX(column1) column1, MAX(column2) column2
FROM table1 
WHERE ttimestamp IN (SELECT MAX(ttimestamp) FROM table1 
GROUP BY foreignID, column1, column2)
GROUP BY foreignId, column1, column2)x);

Please see the working example here: db<>fiddle