how do I optimize a delete query to remove duplicates without creating an index?-CodePudding

I am running a query to delete duplicates (user defined) from a table that have ~ 3M records. The query is:

DELETE t1
FROM       'path_alias_revision' t1
INNER JOIN 'path_alias_revision' t2
WHERE t1.id < t2.id AND  t1.path=t2.path AND binary(t1.alias) = binary(t2.alias)

output of show create table:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
| Table               | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
 --------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
| path_alias_revision | CREATE TABLE `path_alias_revision` (
  `id` int(10) unsigned NOT NULL,
  `revision_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `langcode` varchar(12) CHARACTER SET ascii NOT NULL,
  `path` varchar(255) DEFAULT NULL,
  `alias` varchar(255) DEFAULT NULL,
  `status` tinyint(4) NOT NULL,
  `revision_default` tinyint(4) DEFAULT NULL,
  PRIMARY KEY (`revision_id`),
  KEY `path_alias__id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=80213807 DEFAULT CHARSET=utf8mb4 COMMENT='The revision table for path_alias entities.' |
 --------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The explain output:

explain DELETE t1 FROM  path_alias_revision t1 INNER JOIN path_alias_revision t2 WHERE t1.id < t2.id AND  t1.path=t2.path AND binary(t1.alias) = binary(t2.alias);
 ---- ------------- ------- ------------ ------ ---------------- ------ --------- ------ --------- ---------- ------------------------------------------------ 
| id | select_type | table | partitions | type | possible_keys  | key  | key_len | ref  | rows    | filtered | Extra                                          |
 ---- ------------- ------- ------------ ------ ---------------- ------ --------- ------ --------- ---------- ------------------------------------------------ 
|  1 | DELETE      | t1    | NULL       | ALL  | path_alias__id | NULL | NULL    | NULL | 3105455 |   100.00 | NULL                                           |
|  1 | SIMPLE      | t2    | NULL       | ALL  | path_alias__id | NULL | NULL    | NULL | 3105455 |     3.33 | Range checked for each record (index map: 0x2) |
 ---- ------------- ------- ------------ ------ ---------------- ------ --------- ------ --------- ---------- ------------------------------------------------

I cannot tell if the query is hanging or just taking a long time. the output for show processlist is :


MySQL [acquia]> show processlist \G;
*************************** 1. row ***************************
     Id: 11
   User: acquia
   Host: 172.18.0.3:37498
     db: acquia
Command: Query
   Time: 602
  State: Sending data
   Info: DELETE t1
FROM       path_alias_revision t1
INNER JOIN path_alias_revision t2
WHERE      t1.id < t2.
*************************** 2. row ***************************
     Id: 15
   User: acquia
   Host: 172.18.0.3:37512
     db: acquia
Command: Query
   Time: 0
  State: starting
   Info: show processlist
2 rows in set (0.000 sec)

ERROR: No query specified

What can I do to improve this query? I know I can possibly just move the data I want to retain to a temp table and rename it but I want to understand whats going on here. I have upsized some mysql properties to:

max_allowed_packet = 128M
innodb_buffer_pool_chunk_size = 128M
innodb_buffer_pool_size = 8G

But it's not helping.

CodePudding user response：

Update: The question was edited after I wrote my answer. The OP added a condition that they don't want to create an index. But that's the solution to optimizing the DELETE query. I will leave my original answer below.

You're missing an index, and also if you want to optimize binary comparisons for the alias column, then you should change its collation so the index is based on binary bytes.

mysql> alter table path_alias_revision 
  modify column alias varchar(255) collate utf8mb4_bin, 
  add index (path, alias);

Now you can see the improvement in the EXPLAIN. Naturally the query still has to do a table-scan for t1, but it can use the index to find matching rows. The ken_len and ref shows that it's able to use the index for both columns.

explain DELETE t1 FROM path_alias_revision t1 
INNER JOIN path_alias_revision t2 
WHERE t1.id < t2.id AND  t1.path=t2.path AND t1.alias = t2.alias

*************************** 1. row ***************************
           id: 1
  select_type: DELETE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: path_alias__id,path
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ref
possible_keys: path_alias__id,path
          key: path
      key_len: 1791
          ref: test2.t1.path,test2.t1.alias
         rows: 1
     filtered: 100.00
        Extra: Using where

In my test, it shows rows: 1 because I didn't create test data.

CodePudding user response：

You didn't mention how many duplicate rows are slated for deletion. So this is a guess. InnoDb (the storage engine) puts your entire DELETE statement into a single transaction. It builds up a transaction log and commits the whole thing at once (for ACID's sake). That transaction might be quite enormous, and will use resources (IO and CPU).

The trick to avoiding the oversized transaction? Do the delete in chunks.

Try this: First grab the PKs of the rows you need to delete.

/* make a temp table with the PK values for the rows you want to delete
 * this may take a lot of time but that's OK */

CREATE TEMPORARY TABLE path_revision_alias_dups
SELECT t1.revision_id
  FROM       'path_alias_revision' t1
  INNER JOIN 'path_alias_revision' t2
  WHERE t1.id < t2.id AND  t1.path=t2.path AND binary(t1.alias) = binary(t2.alias);

Now we have a temp table with the rows slated for deletion. Let's use it.

You're going to do your deletion in chunks (here in chunks of 1000 rows). So, you'll need to repeat this next hunk of SQL over and over until there's nothing left to delete.

/* retrieve a subset -- a chunk -- of the IDs to delete, 1000 at a time */
CREATE TEMPORARY TABLE dups_to_delete_now
SELECT revision_id
  FROM path_revision_alias_dups
 LIMIT 1000;

/* delete the rows from your table */
DELETE FROM path_alias_revision
 WHERE revision_id IN (SELECT * FROM dups_to_delete_now);

/* and delete the batch from your first temp table */
DELETE FROM path_revision_alias_dups
 WHERE revision_id IN (SELECT * FROM dups_to_delete_now);

/* clean up, ready for the next chunk */
DROP TABLE dups_to_delete_now;

This is a fairly common query pattern to use when doing large-scale table maintenance.

The first CREATE TEMPORARY TABLE I suggested may take too long, seeing as how you can't add any indexes. It might: without indexes, the query's complexity is O(n²). If it does take too long, you'll need indexes someplace.

CodePudding user response：

The original attempt, involving t1.id < t2.id is terribly inefficient; it will take a trillion operations to perform on a million-row table. (I wish the ref manual did not include it.)

There are many ways to do a DELETE more efficiently. And there is a way to add an INDEX without locking the table (if that is your real fear).

Also, WHERE binary(alias) ... would not use INDEX(alias) !

So, what to do??

If you are deleting much of the table, don't use DELETE; instead use SELECT into a new table, then juggle tables. That applies for virtually any big delete that will remove "much" of the table. (What is "much"? I don't have any good numbers; maybe 1/3, maybe 1/2, certainly 3/4.)
Do NOT use the t1.id < t2.id technique; it is probably the worst ever invented for large tables, even if indexed.

If you can't get rid of the binary(alias), then let's start by creating a new table with and the primary key of your table.

CREATE TABLE helper (
    balias ... NOT NULL, -- from binary(alias)
    id ... NOT NULL,  -- whatever is the PK of your table
    PRIMARY KEY(balias)
) ENGINE=InnoDB;

Now populate it:

  INSERT INTO helper ( balias, id )
      SELECT binary(alias), id  FROM t1;

Then use that table to see what to

DELETE (if only a small number to delete); use a multi-table DELETE
SELECT the rows to "keep" (if a large number) using a LEFT JOIN; the swap tables.