SQL: Combining WHERE, ORDER and LIMIT on 15 million row query-CodePudding

I have 2 tables, item and config.

item has ~15 million rows, config has ~1000 rows.

I want to join the two tables with a WHERE clause and order the results.

This might look something like this:

SELECT
    `t0`.`id`,
    `t0`.`item_name`,
    `t1`.`id`,
    `t1`.`config_name`,
FROM
    `item` t0
    LEFT OUTER JOIN `config` `t1` ON `t0`.`config_id` = `t1`.`id`
WHERE (`t0`.`config_id` = 678)
ORDER BY
    `t0`.`item_name` ASC;

This runs successfully in ~800ms and returns ~50k rows.

I also want to paginate this result, so I run the same query and add a LIMIT:

SELECT
    `t0`.`id`,
    `t0`.`item_name`,
    `t1`.`id`,
    `t1`.`config_name`,
FROM
    `item` t0
    LEFT OUTER JOIN `config` `t1` ON `t0`.`config_id` = `t1`.`id`
WHERE (`t0`.`config_id` = 678)
ORDER BY
    `t0`.`item_name` ASC LIMIT 200;

This query now takes over 5 minutes.

I am trying to understand what is causing this discrepancy.

I can simplify the query, removing the JOIN altogether, querying only the large table to try and isolate the cause of slow down:

SELECT
    `t0`.`id`,
    `t0`.`item_name`,
FROM
    `item` t0
WHERE (`t0`.`config_id` = 678)
ORDER BY
    `t0`.`item_name` ASC;

This query runs fine, but again, adding the LIMIT drastically increases the query time.

How can I solve this issue or better diagnose what is causing it?

The execution plan for the simplified query without LIMIT is as follows:

 ---- ------------- ------- ------------ ------ --------------- ----------- --------- ------- ------- ---------- --------------------------------------- 
| id | select_type | table | partitions | type | possible_keys |    key    | key_len |  ref  | rows  | filtered |                 extra                 |
 ---- ------------- ------- ------------ ------ --------------- ----------- --------- ------- ------- ---------- --------------------------------------- 
|  1 | SIMPLE      | t0    | NULL       | ref  | ITEM_FK_1     | ITEM_FK_1 |       8 | const | 98524 |   100.00 | Using index condition; Using filesort |
 ---- ------------- ------- ------------ ------ --------------- ----------- --------- ------- ------- ---------- ---------------------------------------

Adding LIMIT 200 to the query produces this execution plan:

 ---- ------------- ------- ------------ ------- --------------- -------------------- --------- ------ ------- ---------- -------------------------- 
| id | select_type | table | partitions | type  | possible_keys |        key         | key_len | ref  | rows  | filtered |          extra           |
 ---- ------------- ------- ------------ ------- --------------- -------------------- --------- ------ ------- ---------- -------------------------- 
|  1 | SIMPLE      | t0    | NULL       | index | ITEM_FK_1     | ITEM_RULE_ITEM_UNQ |     775 | NULL | 31933 |     0.63 | Using where; Using index |
 ---- ------------- ------- ------------ ------- --------------- -------------------- --------- ------ ------- ---------- --------------------------

CodePudding user response：

To find rows with config_id=678 and order them by item_name and take only the first 200, you have (among others) the following options:

use an index that is ordered by item_name, and keep on reading until you found 200 rows that also fulfill config_id=678 (no ordering required)
get all rows with config_id=678 using an index on config_id (your foreign key), then order those rows by name, and take the first 200

Which of those is faster depends on your data.

For the first, it will depend on where the rows with config_id=678 are. If e.g. the first 200 rows (ordered by name, e.g. starting with an A) all have this id, this would be very fast: you can read 200 rows, and stop, and don't even have to order anything. If you are unlucky and all those ids are at the end of this list (e.g. only names that start with a Z have this id), you have to read all rows before you find 200 that fit.

The second option depends on the number of rows you have with config_id=678. It will read all 50k of them (using the index), order them, and give you the first 200. This will be somewhere between the fast and the slow option above.

MySQL basically has to guess now which version is faster. For the query with limit 200, it guessed wrong, as apparently, it had to read more rows than expected.

To give you an idea about what MySQL is thinking:

MySQL assumes that you have 98.524 rows (not 50k) with config_id=678 (the number in rows in your first execution plan).
You have 15 Million rows, so the probability that a specific row has that id is 98.524 / 15 Mill = 1/150. You need 200 of them, so you need to read about 200*150=30.000 (or 31.933, the number in your second execution plan) rows until you probably found enough.

Now MySQL compares reading 100k rows plus ordering them to probably reading 30k rows, and chose the latter. And was wrong for this case (although 5 minutes seem a bit much, but there are also other factors like the increased index size or maybe a missing coverage that might slow it down). But could be right for a different id.

If you increase the limit (which you will have to do for later pages), MySQL will switch the execution plan at some point (finding e.g. the first 1.000 with that probability requires about 1.000*150=150k > 100k rows).

So, what can you do:

you can force MySQL to use the index you want, e.g. with ... from item t0 force index (ITEM_FK_1) left outer join .... This has the disadvantage that, depending on the id, a different execution plan might be faster.
you can add an optimal index: the composite index (config_id, item_name) allows you to read only rows with the correct id, and since they are ordered by name, you can stop after the first 200 rows. No matter your data distribution, you always read 200 rows (or less). And assuming id is the primary key, there is no faster solution than this.

I'd go with option 2.

CodePudding user response：

Add this

INDEX(config_id, item_name,  id)   -- in this order!

And DROP any index that is a 'prefix' of that one.