I have a query that left joins one table to a subquery, and although the subquery returns an empty set due to a where condition, the overall query still takes significantly longer with the join than without it, even if I consider the time it takes for the subquery to run. Any ideas why this is happening? Details below
Query:
select Address.*
from Address
left join (
select lotNumber, max(jobId) as id
from Address
where jobId is not null
group by lotNumber
) latestJob on latestJob.lotNumber = Address.lotNumber
Schema:
CREATE TABLE `Address` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`streetNumber` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`street` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`lotNumber` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`jobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_Address_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=1032717 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Explain:
---- ------------- ----------------- ------------ ------- ------------------------------- ------------------------------- --------- --------------------------- --------- ---------- -------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ----------------- ------------ ------- ------------------------------- ------------------------------- --------- --------------------------- --------- ---------- -------------
| 1 | PRIMARY | Address | NULL | ALL | NULL | NULL | NULL | NULL | 1027850 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 183 | Address.lotNumber | 10 | 100.00 | NULL |
| 2 | DERIVED | Address | NULL | index | idx_Address_lotNumber | idx_Address_lotNumber | 183 | NULL | 1027850 | 90.00 | Using where |
---- ------------- ----------------- ------------ ------- ------------------------------- ------------------------------- --------- --------------------------- --------- ---------- -------------
Currently the Address
table has about 1M records, but jobId
is null for all of them so the left joined subquery returns an empty set.
The subquery takes ~0.07 second to run by itself, however the entire query takes ~2.22 seconds. The query without the subquery takes ~0.07 second. It seems that when joining to an empty set, the entire query should just take ~0.07 ~0.07 = ~0.14 second, so where are the extra 2 seconds coming from? It seems like something inefficient is happening in the join operation. Anyway to improve this?
Thanks
CodePudding user response:
In a short answer: add an index for your jobId
column.I have a similar table named proctable
, which has 1.4 m rows generated through a procedure. While the table name and column names differ from yours, the structure is alike( with the unnecessary columns omitted):
CREATE TABLE `proctable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`num` int(11) DEFAULT NULL,
`jid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx` (`num`)
) ENGINE=InnoDB AUTO_INCREMENT=1429159 DEFAULT CHARSET=latin1
The id
has values range from 1 to 1429158, while the num
is randomly generated using function rand()
which is from 1 to 10000. The jid
has all rows set to null. The original query (without indexing column jid
)executes like this:
select *
from proctable t1
left join (select num, max(jid) as id
from proctable
where jid is not null
group by num
) t2 on t1.num = t2.num;
-- 1429158 rows in set (7.08 sec)
explain select *
from proctable t1
left join (select num, max(jid) as id
from proctable
where jid is not null
group by num
) t2 on t1.num = t2.num;
-- here is the execution plan :
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------------ ------------ ------- --------------- ------------- --------- --------------- --------- ---------- -------------
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 1426920 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ref | <auto_key0> | <auto_key0> | 5 | testdb.t1.num | 10 | 100.00 | NULL |
| 2 | DERIVED | proctable | NULL | index | idx | idx | 5 | NULL | 1426920 | 90.00 | Using where |
Then we index the jid
column and do it again.
create index jidx on proctable(jid);
-- the execution plan has changed:
explain select * from proctable t1 left join (select num, max(jid) as id from proctable where jid is not null group by num ) t2 on t1.num = t2.num;
---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- --------------------------------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- --------------------------------------------------------
| 1 | PRIMARY | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 1426920 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | proctable | NULL | range | idx,jidx | jidx | 5 | NULL | 1 | 100.00 | Using index condition; Using temporary; Using filesort |
By using jidx as the key, the DERIVED one using range as the type instead of index has drastically reduced the rows anticipated from 1 million to 1. Let's execute the query.
select *
from proctable t1
left join (select num, max(jid) as id
from proctable
where jid is not null
group by num
) t2 on t1.num = t2.num;
-- 1429158 rows in set (3.73 sec)
Note: This is done on my lab VM which has a measely 4G physical ram. By adding an index for jid
, it nearly reduced the response time by half. Additionally, if you change the outter join to inner join or straight join, the discrepancy is huge.
select * from proctable t1 join (select num, max(jid) as id from proctable where jid is not null group by num ) t2 on t1.num = t2.num;
-- Empty set (0.00 sec)
explain select * from proctable t1 join (select num, max(jid) as id from proctable where jid is not null group by num ) t2 on t1.num = t2.num;
---- ------------- ------------ ------------ ------- --------------- ------ --------- -------- ------ ---------- --------------------------------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------------ ------------ ------- --------------- ------ --------- -------- ------ ---------- --------------------------------------------------------
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | Using where |
| 1 | PRIMARY | t1 | NULL | ref | idx | idx | 5 | t2.num | 146 | 100.00 | NULL |
| 2 | DERIVED | proctable | NULL | range | jidx,idx | jidx | 5 | NULL | 1 | 100.00 | Using index condition; Using temporary; Using filesort |
explain select * from proctable t1 straight_join (select num, max(jid) as id from proctable where jid is not null group by num ) t2 on t1.num = t2.num;
---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- --------------------------------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- --------------------------------------------------------
| 1 | PRIMARY | t1 | NULL | ALL | idx | NULL | NULL | NULL | 1426920 | 100.00 | NULL |
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 50.00 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | proctable | NULL | range | jidx,idx | jidx | 5 | NULL | 1 | 100.00 | Using index condition; Using temporary; Using filesort |
select * from proctable t1 straight_join (select num, max(jid) as id from proctable where jid is not null group by num ) t2 on t1.num = t2.num;
-- Empty set (2.91 sec)