Home > Mobile >  Speed up removal of duplicates in Oracle with indexing
Speed up removal of duplicates in Oracle with indexing

Time:11-22

How to remove duplicate entries from a large Oracle table (200M rows, 20 columns)?

The below query from 2014 is slow. It took 2 minutes to delete 1 duplicate entry for one specific combination of columns (i.e. where col1 = 1 and .. col20 = 'Z').

DELETE sch.table1
 WHERE rowid NOT IN
       (SELECT MIN(rowid)
          FROM sch.table1
         GROUP BY col1, col2, col3, col4,.. ., col20)

Any way to speed it up, e.g. with indexing?

CodePudding user response:

Rather than using an anti-join (and finding the non-matching ROWID and then deleting all the others), you can use the ROW_NUMBER analytic function to directly find the ROWIDs to delete:

DELETE FROM sch.table1 t
WHERE EXISTS(
  SELECT 1
  FROM   (
    SELECT ROW_NUMBER() OVER (
             PARTITION BY col1, col2, col3, col4, ..., col20
             ORDER BY rowid
           ) AS rn
    FROM   sch.table1
  ) x
  WHERE x.ROWID = t.ROWID
  AND   x.rn > 1
);

or:

DELETE FROM sch.table1 t
WHERE ROWID IN (
  SELECT ROWID
  FROM   (
    SELECT ROW_NUMBER() OVER (
             PARTITION BY col1, col2, col3, col4, ..., col20
             ORDER BY rowid
           ) AS rn
    FROM   sch.table1
  )
  WHERE rn > 1
);

fiddle

CodePudding user response:

As a person who has spent 20 years as a professional in data warehousing doing this kind of operation, the best answer I can give you is that investing time in explain plan will return enormous time savings in the long run. The link above is just the syntax of the command. Interpreting the execution plan as detailed Oracle Database Performance Tuning Guide is difficult at first but will be worth the time invested.

In this case, I can tell you that "not in" queries are rarely efficiently optimized by the database engine, but you don't have to believe me, just verify it from the explain plan. The reason is that the execution engine will have to save the entire results of the subquery, all 200 million rows. Even worse, unless Oracle has advanced light years since I last used it, it does not index intermediate tables, so every row that is checked for "not in" is a full scan of the intermediate set. So it is possibly checking 200 million x 200 million comparisons (there may be some partitioning tricks that Oracle uses to reduce that a bit). That's a pretty capable database which can do that in just a few minutes.

So knowing that, you know what to do. Find a subquery that locates just the one row to delete instead of using one that gives you every row that you dont want to delete. @MTO answers along that line. Personally I try to avoid "where exists" for similar reasons, but databases these days might well do a decent job with them.

As a refinement I would make it a two step process and create a separate table having the rows which are found to be candidates for removal, and delete from the base table the matching rows. This way you have a record of the rows deleted in case somebody asks you the next day and you can run sanity checks (like counts) before running the actual deletes, which might some day prevent accidents.

  • Related