I am running a query to delete duplicates (user defined) from a table that have ~ 3M records. The query is:
DELETE t1
FROM 'path_alias_revision' t1
INNER JOIN 'path_alias_revision' t2
WHERE t1.id < t2.id AND t1.path=t2.path AND binary(t1.alias) = binary(t2.alias)
output of show create table
:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Table | Create Table |
--------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| path_alias_revision | CREATE TABLE `path_alias_revision` (
`id` int(10) unsigned NOT NULL,
`revision_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`langcode` varchar(12) CHARACTER SET ascii NOT NULL,
`path` varchar(255) DEFAULT NULL,
`alias` varchar(255) DEFAULT NULL,
`status` tinyint(4) NOT NULL,
`revision_default` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`revision_id`),
KEY `path_alias__id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=80213807 DEFAULT CHARSET=utf8mb4 COMMENT='The revision table for path_alias entities.' |
--------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The explain output:
explain DELETE t1 FROM path_alias_revision t1 INNER JOIN path_alias_revision t2 WHERE t1.id < t2.id AND t1.path=t2.path AND binary(t1.alias) = binary(t2.alias);
---- ------------- ------- ------------ ------ ---------------- ------ --------- ------ --------- ---------- ------------------------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------- ------------ ------ ---------------- ------ --------- ------ --------- ---------- ------------------------------------------------
| 1 | DELETE | t1 | NULL | ALL | path_alias__id | NULL | NULL | NULL | 3105455 | 100.00 | NULL |
| 1 | SIMPLE | t2 | NULL | ALL | path_alias__id | NULL | NULL | NULL | 3105455 | 3.33 | Range checked for each record (index map: 0x2) |
---- ------------- ------- ------------ ------ ---------------- ------ --------- ------ --------- ---------- ------------------------------------------------
I cannot tell if the query is hanging or just taking a long time. the output for show processlist
is :
MySQL [acquia]> show processlist \G;
*************************** 1. row ***************************
Id: 11
User: acquia
Host: 172.18.0.3:37498
db: acquia
Command: Query
Time: 602
State: Sending data
Info: DELETE t1
FROM path_alias_revision t1
INNER JOIN path_alias_revision t2
WHERE t1.id < t2.
*************************** 2. row ***************************
Id: 15
User: acquia
Host: 172.18.0.3:37512
db: acquia
Command: Query
Time: 0
State: starting
Info: show processlist
2 rows in set (0.000 sec)
ERROR: No query specified
What can I do to improve this query? I know I can possibly just move the data I want to retain to a temp table and rename it but I want to understand whats going on here. I have upsized some mysql properties to:
max_allowed_packet = 128M
innodb_buffer_pool_chunk_size = 128M
innodb_buffer_pool_size = 8G
But it's not helping.
CodePudding user response:
Update: The question was edited after I wrote my answer. The OP added a condition that they don't want to create an index. But that's the solution to optimizing the DELETE query. I will leave my original answer below.
You're missing an index, and also if you want to optimize binary comparisons for the alias
column, then you should change its collation so the index is based on binary bytes.
mysql> alter table path_alias_revision
modify column alias varchar(255) collate utf8mb4_bin,
add index (path, alias);
Now you can see the improvement in the EXPLAIN. Naturally the query still has to do a table-scan for t1
, but it can use the index to find matching rows. The ken_len
and ref
shows that it's able to use the index for both columns.
explain DELETE t1 FROM path_alias_revision t1
INNER JOIN path_alias_revision t2
WHERE t1.id < t2.id AND t1.path=t2.path AND t1.alias = t2.alias
*************************** 1. row ***************************
id: 1
select_type: DELETE
table: t1
partitions: NULL
type: ALL
possible_keys: path_alias__id,path
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t2
partitions: NULL
type: ref
possible_keys: path_alias__id,path
key: path
key_len: 1791
ref: test2.t1.path,test2.t1.alias
rows: 1
filtered: 100.00
Extra: Using where
In my test, it shows rows: 1
because I didn't create test data.
CodePudding user response:
You didn't mention how many duplicate rows are slated for deletion. So this is a guess. InnoDb (the storage engine) puts your entire DELETE
statement into a single transaction. It builds up a transaction log and commits the whole thing at once (for ACID's sake). That transaction might be quite enormous, and will use resources (IO and CPU).
The trick to avoiding the oversized transaction? Do the delete in chunks.
Try this: First grab the PKs of the rows you need to delete.
/* make a temp table with the PK values for the rows you want to delete
* this may take a lot of time but that's OK */
CREATE TEMPORARY TABLE path_revision_alias_dups
SELECT t1.revision_id
FROM 'path_alias_revision' t1
INNER JOIN 'path_alias_revision' t2
WHERE t1.id < t2.id AND t1.path=t2.path AND binary(t1.alias) = binary(t2.alias);
Now we have a temp table with the rows slated for deletion. Let's use it.
You're going to do your deletion in chunks (here in chunks of 1000 rows). So, you'll need to repeat this next hunk of SQL over and over until there's nothing left to delete.
/* retrieve a subset -- a chunk -- of the IDs to delete, 1000 at a time */
CREATE TEMPORARY TABLE dups_to_delete_now
SELECT revision_id
FROM path_revision_alias_dups
LIMIT 1000;
/* delete the rows from your table */
DELETE FROM path_alias_revision
WHERE revision_id IN (SELECT * FROM dups_to_delete_now);
/* and delete the batch from your first temp table */
DELETE FROM path_revision_alias_dups
WHERE revision_id IN (SELECT * FROM dups_to_delete_now);
/* clean up, ready for the next chunk */
DROP TABLE dups_to_delete_now;
This is a fairly common query pattern to use when doing large-scale table maintenance.
The first CREATE TEMPORARY TABLE I suggested may take too long, seeing as how you can't add any indexes. It might: without indexes, the query's complexity is O(n2). If it does take too long, you'll need indexes someplace.
CodePudding user response:
The original attempt, involving t1.id < t2.id
is terribly inefficient; it will take a trillion operations to perform on a million-row table. (I wish the ref manual did not include it.)
There are many ways to do a DELETE
more efficiently. And there is a way to add an INDEX
without locking the table (if that is your real fear).
Also, WHERE binary(alias) ...
would not use INDEX(alias)
!
So, what to do??
If you are deleting much of the table, don't use
DELETE
; instead useSELECT
into a new table, then juggle tables. That applies for virtually any big delete that will remove "much" of the table. (What is "much"? I don't have any good numbers; maybe 1/3, maybe 1/2, certainly 3/4.)Do NOT use the
t1.id < t2.id
technique; it is probably the worst ever invented for large tables, even if indexed.If you can't get rid of the
binary(alias)
, then let's start by creating a new table with and the primary key of your table.CREATE TABLE helper ( balias ... NOT NULL, -- from binary(alias) id ... NOT NULL, -- whatever is the PK of your table PRIMARY KEY(balias) ) ENGINE=InnoDB;
Now populate it:
INSERT INTO helper ( balias, id )
SELECT binary(alias), id FROM t1;
Then use that table to see what to
DELETE
(if only a small number to delete); use a multi-table DELETESELECT
the rows to "keep" (if a large number) using aLEFT JOIN
; the swap tables.