h2. Experiment
First we create a table:
create database test;
use test;
create table user_purchase (order_id int primary key auto_increment,user_id int,amount int);
create table users (user_id int primary key auto_increment,name varchar(15),age smallint(4));
ALTER TABLE user_purchase ADD FOREIGN KEY (user_id) REFERENCES users(user_id);
Second insert some random data:
wget https://github.com/Percona-Lab/mysql_random_data_load/releases/download/v0.1.12/mysql_random_data_load_0.1.12_Linux_x86_64.tar.gz && tar -xvf mysql_random_data_load_0.1.12_Linux_x86_64.tar.gz && chmod 744 mysql_random_data_load
./mysql_random_data_load test user_purchase 4000 --host 127.0.0.1 --password 123 --user root
./mysql_random_data_load test users 10000 --host 127.0.0.1 --password 123 --user root
Third We login in and execute two quries:
Select * FROM users as u where exists (select 1 from user_purchase as up where up.user_id = u.user_id); ===> it takes about 0.05 sec
Select * from users where user_id IN (select user_id from user_purchase group by user_id); ===> it takes about 0.02 sec
h2. Question
When using "IN" operator it stably takes 0.02sec, however, when using "EXISTS", it stably takes 0.04sec or even longer, Why using IN is faster when it has to do much more row scanning?
h2. Expalins
mysql> EXPLAIN Select * from users where user_id IN (select user_id from user_purchase group by user_id);
---- -------------- --------------- ------------ -------- --------------- ------------ --------- -------------------- ------- ---------- -------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- -------------- --------------- ------------ -------- --------------- ------------ --------- -------------------- ------- ---------- -------------
| 1 | SIMPLE | users | NULL | ALL | PRIMARY | NULL | NULL | NULL | 11000 | 100.00 | Using where |
| 1 | SIMPLE | <subquery2> | NULL | eq_ref | <auto_key> | <auto_key> | 5 | test.users.user_id | 1 | 100.00 | NULL |
| 2 | MATERIALIZED | user_purchase | NULL | index | user_id | user_id | 5 | NULL | 5000 | 100.00 | Using index |
---- -------------- --------------- ------------ -------- --------------- ------------ --------- -------------------- ------- ---------- -------------
3 rows in set, 1 warning (0.00 sec)
mysql> EXPlain Select * FROM users as u where exists (select 1 from user_purchase as up where up.user_id = u.user_id);
---- -------------------- ------- ------------ ------ --------------- --------- --------- ---------------- ------- ---------- -------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- -------------------- ------- ------------ ------ --------------- --------- --------- ---------------- ------- ---------- -------------
| 1 | PRIMARY | u | NULL | ALL | NULL | NULL | NULL | NULL | 11000 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | up | NULL | ref | user_id | user_id | 5 | test.u.user_id | 1 | 100.00 | Using index |
---- -------------------- ------- ------------ ------ --------------- --------- --------- ---------------- ------- ---------- -------------
2 rows in set, 2 warnings (0.00 sec)
CodePudding user response:
For the "IN" case it executes the subquery in first place and keeps it's result in memory. That's what "Memorization" stays for in the Explain
From the docs
Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory
While "EXISTS" part performs the sub-query for each set of unique values of parent query.
From the docs
For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context
Technically, there are different sub-queries in IN and in EXISTS clauses
In the "IN" you have
select user_id from user_purchase group by user_id
So, it's enough to perform it once and keep the result in memory
but in the EXISTS
select 1 from user_purchase as up where up.user_id = u.user_id
The "WHERE" clause says "execute this query for every set of different values of "user_id" from parent query".
That's why it takes longer for EXISTS to perform