I have a question about indexing and a strange (I suppose) behaviour I found when trying out some queries I'm using in a project.
I have the following schema:
a) an `activities` table
--------- -------------------------------------------------------- ------ ----- ------------------- -------------------
| Field | Type | Null | Key | Default | Extra |
--------- -------------------------------------------------------- ------ ----- ------------------- -------------------
| id | bigint unsigned | NO | PRI | NULL | auto_increment |
| date | timestamp | NO | UNI | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
| address | varchar(255) | NO | | NULL | |
| title | varchar(255) | NO | | NULL | |
| state | enum('todo','done','interrupted','pending','rejected') | NO | | todo | |
--------- -------------------------------------------------------- ------ ----- ------------------- -------------------
b) a users
table
------------------- --------------- ------ ----- ------------------- -------------------
| Field | Type | Null | Key | Default | Extra |
------------------- --------------- ------ ----- ------------------- -------------------
| id | varchar(128) | NO | PRI | NULL | |
| email | varchar(1024) | YES | | NULL | |
| name | varchar(1024) | YES | | NULL | |
| surname | varchar(1024) | YES | | NULL | |
| phone_number | varchar(1024) | NO | | NULL | |
| propic_url | varchar(500) | YES | | NULL |
c) and a worker_to_activity
table to assign workers to a specified activity
------------- ----------------- ------ ----- --------- -------
| Field | Type | Null | Key | Default | Extra |
------------- ----------------- ------ ----- --------- -------
| activity_id | bigint unsigned | NO | PRI | NULL | |
| worker_id | varchar(128) | NO | PRI | NULL | |
------------- ----------------- ------ ----- --------- -------
I have a very simple query that searches for all the activities in a certain time range in which a specified user is present, something like this:
select
a.date
a.id,
a.address,
a.title,
a.state
from activities a
where a.date >= "start" and a.date <= "end"
and exists ( select worker_id from worker_to_activity where worker_id = "worker_id" and activity_id = a.id )
limit 11;
And, last but not least, I have an index for this purpose:
create index act_fetch_idx (date, id, address, title, state) on activities;
For testing purposes (seeing if the index works) I deliberately inserted an activity that happens before other activities, but with an higher ID, simulating somebody inserting an activity later than expected (I'm talking about activity #50 in the following view):
---- ---------------------
| id | date |
---- ---------------------
| 1 | 2022-09-14 08:00:00 |
| 2 | 2022-09-14 09:00:00 |
| 3 | 2022-09-14 10:00:00 |
| 50 | 2022-09-14 10:55:49 |
| 4 | 2022-09-14 11:00:00 |
| 5 | 2022-09-14 12:00:00 |
| 6 | 2022-09-14 13:00:00 |
| 7 | 2022-09-14 14:00:00 |
| 8 | 2022-09-14 15:00:00 |
| 9 | 2022-09-14 16:00:00 |
| 10 | 2022-09-14 17:00:00 |
| 11 | 2022-09-14 18:00:00 |
| 12 | 2022-09-14 19:00:00 |
| 13 | 2022-09-15 08:00:00 |
| 14 | 2022-09-15 09:00:00 |
| 15 | 2022-09-15 10:00:00 |
| 16 | 2022-09-15 11:00:00 |
| 17 | 2022-09-15 12:00:00 |
| 18 | 2022-09-15 13:00:00 |
| 19 | 2022-09-15 14:00:00 |
| 20 | 2022-09-15 15:00:00 |
In this example, the index works! It's a simple select id, date from activities
Something strange happens when I use date ranges:
a) In this case (using a "daily" range) the index works:
select
a.date
a.id,
a.address,
a.title,
a.state
from activities a
where a.date >= "2022-09-14 00:00" and a.date <= "2022-09-14 23:59"
and exists ( select worker_id from worker_to_activity where worker_id = "worker_id" and activity_id = a.id )
limit 11;
Output:
---- ---------------------
| id | date |
---- ---------------------
| 1 | 2022-09-14 08:00:00 |
| 2 | 2022-09-14 09:00:00 |
| 3 | 2022-09-14 10:00:00 |
| 50 | 2022-09-14 10:55:49 |
| 4 | 2022-09-14 11:00:00 |
| 5 | 2022-09-14 12:00:00 |
| 6 | 2022-09-14 13:00:00 |
| 7 | 2022-09-14 14:00:00 |
| 8 | 2022-09-14 15:00:00 |
| 9 | 2022-09-14 16:00:00 |
| 10 | 2022-09-14 17:00:00 |
---- ---------------------
b) When I use a wider range, i.e the activities in the next two days, the index "stops working (?)":
select
a.date
a.id,
a.address,
a.title,
a.state
from activities a
where a.date >= "2022-09-14 00:00" and a.date <= "2022-09-15 23:59"
and exists ( select worker_id from worker_to_activity where worker_id = "worker_id" and activity_id = a.id )
limit 11;
Output:
---- ---------------------
| id | date |
---- ---------------------
| 1 | 2022-09-14 08:00:00 |
| 2 | 2022-09-14 09:00:00 |
| 3 | 2022-09-14 10:00:00 |
| 4 | 2022-09-14 11:00:00 |
| 5 | 2022-09-14 12:00:00 |
| 6 | 2022-09-14 13:00:00 |
| 7 | 2022-09-14 14:00:00 |
| 8 | 2022-09-14 15:00:00 |
| 9 | 2022-09-14 16:00:00 |
| 10 | 2022-09-14 17:00:00 |
| 11 | 2022-09-14 18:00:00 |
---- ---------------------
The EXPLAIN
of both queries is slightly different, but I don't understand what causes this behaviour:
a) EXPLAIN
a:
---- ------------- -------------------- ------------ -------- ---------------------------- --------------- --------- ----------------------- ------ ---------- --------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- -------------------- ------------ -------- ---------------------------- --------------- --------- ----------------------- ------ ---------- --------------------------
| 1 | SIMPLE | a | NULL | range | PRIMARY,date,act_fetch_idx | act_fetch_idx | 4 | NULL | 13 | 100.00 | Using where; Using index |
| 1 | SIMPLE | worker_to_activity | NULL | eq_ref | PRIMARY,worker_id | PRIMARY | 522 | db.a.id,const | 1 | 100.00 | Using index |
---- ------------- -------------------- ------------ -------- ---------------------------- --------------- --------- ----------------------- ------ ---------- --------------------------
b) EXPLAIN
b:
---- ------------- -------------------- ------------ -------- --------------------------------------------- ----------- --------- ------------------------------------------- ------ ---------- -------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- -------------------- ------------ -------- --------------------------------------------- ----------- --------- ------------------------------------------- ------ ---------- -------------
| 1 | SIMPLE | worker_to_activity | NULL | ref | PRIMARY,worker_id | worker_id | 514 | const | 23 | 100.00 | Using index |
| 1 | SIMPLE | a | NULL | eq_ref | PRIMARY,date,act_fetch_idx,count_activities | PRIMARY | 8 | db.worker_to_activity.activity_id | 1 | 53.49 | Using where |
---- ------------- -------------------- ------------ -------- --------------------------------------------- ----------- --------- ------------------------------------------- ------ ---------- -------------
The naive solution is using ORDER BY
(which produces the same result as the first scenario both in execution plan and results), but my question was more about why this happens: shouldn't the index sort the results in both cases? I searched on the web but I couldn't find much.
Thanks in advance for your time.
CodePudding user response:
Those EXPLAINs are quite different.
The first query noticed that the date range was "small" so it decided to start with activities
, fetch a few rows, check each against the other table.
The second query decided that the other table might be a better starting point.
(Caveat: Since the Optimizer has no cross-table statistics, there is a non-trivial chance that it will pick the 'wrong' order to look at the two tables.)
It is usually "wrong" to use LIMIT
without ORDER BY
. Since the query plans produced the rows in different ways, you should expect the 11 rows to be different and sorted differently.
You could experiment further by using EXPLAIN FORMAT=JSON SELECT ...
-- this gives some more clues into the Optimizer's thinking. For even more clues, see "Optimizer trace". And I like checking the Handler
counts: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts