Optimisation of sql query for deleting duplicate items from large table-CodePudding

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.

Table Structure

-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------

Query

The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.

DELETE
FROM records
WHERE id NOT IN
      (SELECT DISTINCT
          ON (name_id, name) id
       FROM records);

CodePudding user response：

I would write your delete using exists logic:

DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
              WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
                    r2.id < r1.id);

This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:

CREATE INDEX idx ON records (name_id, name, id);

CodePudding user response：

You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:

WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)   
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)

Even without the index, this should work relatively fast, probably because of merge join strategy.