Why does a MySQL 5.7 left join to an empty set cause the query to perform worse?-CodePudding

I have a query that left joins one table to a subquery, and although the subquery returns an empty set due to a where condition, the overall query still takes significantly longer with the join than without it, even if I consider the time it takes for the subquery to run. Any ideas why this is happening? Details below

Query:

select Address.*
from Address
left join (
    select lotNumber, max(jobId) as id
    from Address
    where jobId is not null
    group by lotNumber
) latestJob on latestJob.lotNumber = Address.lotNumber

Schema:

CREATE TABLE `Address` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `streetNumber` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `street` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `lotNumber` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `jobId` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_Address_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=1032717 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Explain:

 ---- ------------- ----------------- ------------ ------- ------------------------------- ------------------------------- --------- --------------------------- --------- ---------- ------------- 
| id | select_type | table           | partitions | type  | possible_keys                 | key                           | key_len | ref                       | rows    | filtered | Extra       |
 ---- ------------- ----------------- ------------ ------- ------------------------------- ------------------------------- --------- --------------------------- --------- ---------- ------------- 
|  1 | PRIMARY     | Address         | NULL       | ALL   | NULL                          | NULL                          | NULL    | NULL                      | 1027850 |   100.00 | NULL        |
|  1 | PRIMARY     | <derived2>      | NULL       | ref   | <auto_key0>                   | <auto_key0>                   | 183     | Address.lotNumber         |      10 |   100.00 | NULL        |
|  2 | DERIVED     | Address         | NULL       | index | idx_Address_lotNumber         | idx_Address_lotNumber         | 183     | NULL                      | 1027850 |    90.00 | Using where |
 ---- ------------- ----------------- ------------ ------- ------------------------------- ------------------------------- --------- --------------------------- --------- ---------- -------------

Currently the Address table has about 1M records, but jobId is null for all of them so the left joined subquery returns an empty set.

The subquery takes ~0.07 second to run by itself, however the entire query takes ~2.22 seconds. The query without the subquery takes ~0.07 second. It seems that when joining to an empty set, the entire query should just take ~0.07 ~0.07 = ~0.14 second, so where are the extra 2 seconds coming from? It seems like something inefficient is happening in the join operation. Anyway to improve this?

Thanks

CodePudding user response：

In a short answer: add an index for your jobId column.I have a similar table named proctable, which has 1.4 m rows generated through a procedure. While the table name and column names differ from yours, the structure is alike( with the unnecessary columns omitted):

 CREATE TABLE `proctable` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `num` int(11) DEFAULT NULL,
  `jid` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx` (`num`)
) ENGINE=InnoDB AUTO_INCREMENT=1429159 DEFAULT CHARSET=latin1

The id has values range from 1 to 1429158, while the num is randomly generated using function rand() which is from 1 to 10000. The jid has all rows set to null. The original query (without indexing column jid)executes like this:

select * 
from proctable t1
left join (select  num, max(jid) as id
    from proctable
    where jid is not null
    group by num
) t2 on t1.num = t2.num;
-- 1429158 rows in set (7.08 sec)   

explain select * 
from proctable t1
left join (select  num, max(jid) as id
    from proctable
    where jid is not null
    group by num
) t2 on t1.num = t2.num;

-- here is the execution plan :

| id | select_type | table      | partitions | type  | possible_keys | key         | key_len | ref           | rows    | filtered | Extra       |
 ---- ------------- ------------ ------------ ------- --------------- ------------- --------- --------------- --------- ---------- ------------- 
|  1 | PRIMARY     | t1         | NULL       | ALL   | NULL          | NULL        | NULL    | NULL          | 1426920 |   100.00 | NULL        |
|  1 | PRIMARY     | <derived2> | NULL       | ref   | <auto_key0>   | <auto_key0> | 5       | testdb.t1.num |      10 |   100.00 | NULL        |
|  2 | DERIVED     | proctable  | NULL       | index | idx           | idx         | 5       | NULL          | 1426920 |    90.00 | Using where |

Then we index the jid column and do it again.

create index jidx on proctable(jid);

-- the execution plan has changed:
explain select *  from proctable t1 left join (select  num, max(jid) as id     from proctable     where jid is not null     group by num ) t2 on t1.num = t2.num;
 ---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- -------------------------------------------------------- 
| id | select_type | table      | partitions | type  | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                                                  |
 ---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- -------------------------------------------------------- 
|  1 | PRIMARY     | t1         | NULL       | ALL   | NULL          | NULL | NULL    | NULL | 1426920 |   100.00 | NULL                                                   |
|  1 | PRIMARY     | <derived2> | NULL       | ALL   | NULL          | NULL | NULL    | NULL |       2 |   100.00 | Using where; Using join buffer (Block Nested Loop)     |
|  2 | DERIVED     | proctable  | NULL       | range | idx,jidx      | jidx | 5       | NULL |       1 |   100.00 | Using index condition; Using temporary; Using filesort |

By using jidx as the key, the DERIVED one using range as the type instead of index has drastically reduced the rows anticipated from 1 million to 1. Let's execute the query.

select * 
from proctable t1
left join (select  num, max(jid) as id
    from proctable
    where jid is not null
    group by num
) t2 on t1.num = t2.num;
-- 1429158 rows in set (3.73 sec)

Note: This is done on my lab VM which has a measely 4G physical ram. By adding an index for jid , it nearly reduced the response time by half. Additionally, if you change the outter join to inner join or straight join, the discrepancy is huge.

select *  from proctable t1   join (select  num, max(jid) as id     from proctable     where jid is not null     group by num ) t2 on t1.num = t2.num;
-- Empty set (0.00 sec)

explain  select *  from proctable t1   join (select  num, max(jid) as id     from proctable     where jid is not null     group by num ) t2 on t1.num = t2.num;
 ---- ------------- ------------ ------------ ------- --------------- ------ --------- -------- ------ ---------- -------------------------------------------------------- 
| id | select_type | table      | partitions | type  | possible_keys | key  | key_len | ref    | rows | filtered | Extra                                                  |
 ---- ------------- ------------ ------------ ------- --------------- ------ --------- -------- ------ ---------- -------------------------------------------------------- 
|  1 | PRIMARY     | <derived2> | NULL       | ALL   | NULL          | NULL | NULL    | NULL   |    2 |   100.00 | Using where                                            |
|  1 | PRIMARY     | t1         | NULL       | ref   | idx           | idx  | 5       | t2.num |  146 |   100.00 | NULL                                                   |
|  2 | DERIVED     | proctable  | NULL       | range | jidx,idx      | jidx | 5       | NULL   |    1 |   100.00 | Using index condition; Using temporary; Using filesort |

explain  select *  from proctable t1   straight_join (select  num, max(jid) as id     from proctable     where jid is not null     group by num ) t2 on t1.num = t2.num;
 ---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- -------------------------------------------------------- 
| id | select_type | table      | partitions | type  | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                                                  |
 ---- ------------- ------------ ------------ ------- --------------- ------ --------- ------ --------- ---------- -------------------------------------------------------- 
|  1 | PRIMARY     | t1         | NULL       | ALL   | idx           | NULL | NULL    | NULL | 1426920 |   100.00 | NULL                                                   |
|  1 | PRIMARY     | <derived2> | NULL       | ALL   | NULL          | NULL | NULL    | NULL |       2 |    50.00 | Using where; Using join buffer (Block Nested Loop)     |
|  2 | DERIVED     | proctable  | NULL       | range | jidx,idx      | jidx | 5       | NULL |       1 |   100.00 | Using index condition; Using temporary; Using filesort |

select *  from proctable t1   straight_join (select  num, max(jid) as id     from proctable     where jid is not null     group by num ) t2 on t1.num = t2.num;
-- Empty set (2.91 sec)