Keep only first row given a set of columns in PostgreSQL-CodePudding

I have a table that looks like this:

id	creation_date	col1	col2	col3
1	21 April	a	b	e
2	23 April	a	b	e
3	21 April	c	d	e

My goal is to delete all the duplicates, keeping only the first (ie first created) row.

The duplicates are defined using a set of columns from the table. For example, using col1 and col2, the result will be :

id	creation_date	col1	col2	col3
1	21 April	a	b	e
3	21 April	c	d	e

One answer that I found use one column as a key :

DELETE FROM dups a USING (
      SELECT MIN(ctid) as ctid, key
        FROM dups 
        GROUP BY key HAVING COUNT(*) > 1
      ) b
      WHERE a.key = b.key 
      AND a.ctid <> b.ctid

, but I can't see how to adapt to the case where multiple columns are used, and where the earliest record should be kept.

CodePudding user response：

You can try to use DELETE and EXISTS with your logic in subquery condition.

DELETE 
FROM dups t
WHERE EXISTS (SELECT 1 
               FROM dups tt
               WHERE tt.col1 =  t.col1
               AND tt.col2 =  t.col2
               AND tt.col3 =  t.col3
               HAVING MIN(tt.ctid) <> t.ctid
               );

sqlfiddle

CodePudding user response：

You can use an implicit self join of the table:

DELETE FROM dups d1
USING dups d2
WHERE (d2.col1, d2.col2, d2.col3) = (d1.col1, d1.col2, d1.col3)
  AND d2.creation_date < d1.creation_date;

See the demo.