Is it a misconception that using 'EXISTS' is faster than 'IN'?-CodePudding

h2. Experiment

First we create a table:

create database test;
use test;
create table user_purchase (order_id int primary key  auto_increment,user_id int,amount int);
create table users (user_id int primary key  auto_increment,name varchar(15),age smallint(4));
ALTER TABLE user_purchase ADD FOREIGN KEY (user_id) REFERENCES users(user_id);

Second insert some random data:

wget https://github.com/Percona-Lab/mysql_random_data_load/releases/download/v0.1.12/mysql_random_data_load_0.1.12_Linux_x86_64.tar.gz && tar -xvf mysql_random_data_load_0.1.12_Linux_x86_64.tar.gz  && chmod 744 mysql_random_data_load

./mysql_random_data_load test user_purchase 4000 --host 127.0.0.1  --password 123 --user root
./mysql_random_data_load test users 10000 --host 127.0.0.1  --password 123 --user root

Third We login in and execute two quries:

Select * FROM users as u where exists (select 1 from user_purchase as up where up.user_id = u.user_id); ===> it takes about 0.05 sec

Select * from users where user_id IN (select user_id from user_purchase group by user_id);  ===> it takes about 0.02 sec

h2. Question

When using "IN" operator it stably takes 0.02sec, however, when using "EXISTS", it stably takes 0.04sec or even longer, Why using IN is faster when it has to do much more row scanning?

h2. Expalins

mysql> EXPLAIN Select * from users where user_id IN (select user_id from user_purchase group by user_id);
 ---- -------------- --------------- ------------ -------- --------------- ------------ --------- -------------------- ------- ---------- ------------- 
| id | select_type  | table         | partitions | type   | possible_keys | key        | key_len | ref                | rows  | filtered | Extra       |
 ---- -------------- --------------- ------------ -------- --------------- ------------ --------- -------------------- ------- ---------- ------------- 
|  1 | SIMPLE       | users         | NULL       | ALL    | PRIMARY       | NULL       | NULL    | NULL               | 11000 |   100.00 | Using where |
|  1 | SIMPLE       | <subquery2>   | NULL       | eq_ref | <auto_key>    | <auto_key> | 5       | test.users.user_id |     1 |   100.00 | NULL        |
|  2 | MATERIALIZED | user_purchase | NULL       | index  | user_id       | user_id    | 5       | NULL               |  5000 |   100.00 | Using index |
 ---- -------------- --------------- ------------ -------- --------------- ------------ --------- -------------------- ------- ---------- ------------- 
3 rows in set, 1 warning (0.00 sec)

mysql> EXPlain Select * FROM users as u where exists (select 1 from user_purchase as up where up.user_id = u.user_id);
 ---- -------------------- ------- ------------ ------ --------------- --------- --------- ---------------- ------- ---------- ------------- 
| id | select_type        | table | partitions | type | possible_keys | key     | key_len | ref            | rows  | filtered | Extra       |
 ---- -------------------- ------- ------------ ------ --------------- --------- --------- ---------------- ------- ---------- ------------- 
|  1 | PRIMARY            | u     | NULL       | ALL  | NULL          | NULL    | NULL    | NULL           | 11000 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | up    | NULL       | ref  | user_id       | user_id | 5       | test.u.user_id |     1 |   100.00 | Using index |
 ---- -------------------- ------- ------------ ------ --------------- --------- --------- ---------------- ------- ---------- ------------- 
2 rows in set, 2 warnings (0.00 sec)

CodePudding user response：

For the "IN" case it executes the subquery in first place and keeps it's result in memory. That's what "Memorization" stays for in the Explain

From the docs

Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory

While "EXISTS" part performs the sub-query for each set of unique values of parent query.

From the docs

For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context

Technically, there are different sub-queries in IN and in EXISTS clauses

In the "IN" you have

select user_id from user_purchase group by user_id

So, it's enough to perform it once and keep the result in memory

but in the EXISTS

select 1 from user_purchase as up where up.user_id = u.user_id

The "WHERE" clause says "execute this query for every set of different values of "user_id" from parent query".

That's why it takes longer for EXISTS to perform