Im not good in sql, help me please understand this situation. We have mysql 5.6 and table with ~30k rows. I need to find record with some conditions and max date. This is a query: (cant paste the original query cause of nda)
SELECT * FROM records r1
WHERE column1='X' AND column2='Y' AND
column3 = (SELECT MAX(column3) FROM records r2
WHERE r2.column1=r1.column1 AND r2.column2=r1.column2 AND r2.column3 <= '<some_date>');
Duration: 14.094 sec
Explain:
select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
---|---|---|---|---|---|---|---|---|
PRIMARY | r1 | all | 30k | using where | ||||
DEPENDENT SUBQUERY | r2 | all | 30k | using where |
I guess it is so called correlative subquery
And there is another one using derived table:
SELECT * FROM records r1
WHERE column1='X' AND column2='Y' AND
column3 = (SELECT MAX(column3) FROM
(SELECT * FROM records
WHERE column1='X' AND column2='Y' AND column3 <= '<some_date>') r2);
Duration: 0.047 sec
Explain:
select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
---|---|---|---|---|---|---|---|---|
PRIMARY | r1 | all | 30k | using where | ||||
SUBQUERY | derived | all | 30k | |||||
DEPENDENT SUBQUERY | r2 | all | 30k | using where |
Both queries return same result, why is there such big difference in performance? Does it mysql 5.6 thing? I dont see any clue in explain, maybe i just dont understand it well.
CodePudding user response:
First, and most important, none of the columns are indexed. Each search of the table will result in at least one table scan. And the subqueries might have to scan the table for each match. That's bad. A composite index (column1, column2)
would probably help immensely.
Second, MySQL 5.6 was end-of-lifed in 2021. Consider upgrading to at least MySQL 5.7 and probably MySQL 8.0 which adds many important features and may improve the query optimizer. In particular, MySQL 5.6 and 5.7 both say...
For certain cases, a correlated subquery is optimized... Otherwise, they are inefficient and likely to be slow. Rewriting the query as a join might improve performance.
Whereas MySQL 8 no longer has that warning instead saying...
Beginning with MySQL 8.0.24, the optimizer can transform a correlated scalar subquery to a derived table when the subquery_to_derived flag of the optimizer_switch variable is enabled.
I'd suggest trying your query on MySQL 8.0.
An optimizer trace would provide more information, but here's what I think is happening.
SELECT *
FROM records r1
WHERE column1='X'
AND column2='Y'
AND column3 = (
SELECT MAX(column3)
FROM (
SELECT *
FROM records
WHERE column1='X'
AND column2='Y'
AND column3 <= '<some_date>'
) r2
);
Note that the subquery where
clause uses all hard coded values. That means it only needs to be run once and the value can be cached. That's a pretty simple optimization that the optimizer could be doing. We can think of it like so:
-- One table scan
SELECT @subquery:=MAX(column3)
FROM (
SELECT *
FROM records
WHERE column1='X'
AND column2='Y'
AND column3 <= '<some_date>'
) r2
);
-- One table scan
SELECT *
FROM records r1
WHERE column1='X'
AND column2='Y'
AND column3 = @subquery;
With 30,000 rows, two table scans is just 60,000 rows.
That's one possibility. An optimizer trace would tell the whole story.
Now let's look at the first query.
SELECT *
FROM records r1
WHERE column1='X'
AND column2='Y'
AND column3 = (
SELECT MAX(column3)
FROM records r2
WHERE r2.column1=r1.column1
AND r2.column2=r1.column2
AND r2.column3 <= '<some_date>'
);
Again, because of the lack of indexes the outer query has to do a full table scan. The subquery must also do a full table scan, but because it is correlated it must do so for each matching row.
Let's assume the optimizer is smart and it checks that column1 and column2 match before checking column3. It must run the subquery for each matching row. Each run is a full table scan. Let's say there are 300 rows with matching column1 and column2. That's 301 table scans or 9 million rows. That would explain the performance difference.
The optimizer could be smarter and only run the subquery once on the subset of rows which match on column1 and column2, but it doesn't seem to be doing that. MySQL 8 might be able to better optimize the query.
If you don't want to repeat the where conditions in the subquery, try a self-join.
SELECT *
FROM records r1
JOIN (
SELECT column1, column2, MAX(column3) max3
FROM records
WHERE column3 <= '<some_date>'
GROUP BY column1, column2
) r2 ON
r1.column1 = r2.column1 AND
r1.column2 = r2.column2 AND
r1.column3 = r2.max3