Performance difference between correlated subquery and derived table in mysql 5.6-CodePudding

Im not good in sql, help me please understand this situation. We have mysql 5.6 and table with ~30k rows. I need to find record with some conditions and max date. This is a query: (cant paste the original query cause of nda)

    SELECT * FROM records r1 
    WHERE column1='X' AND column2='Y' AND
    column3 = (SELECT MAX(column3) FROM records r2 
               WHERE r2.column1=r1.column1 AND r2.column2=r1.column2 AND r2.column3 <= '<some_date>');

Duration: 14.094 sec
Explain:

select_type	table	type	possible_keys	key	key_len	ref	rows	extra
PRIMARY	r1	all					30k	using where
DEPENDENT SUBQUERY	r2	all					30k	using where

I guess it is so called correlative subquery

And there is another one using derived table:

    SELECT * FROM records r1 
    WHERE column1='X' AND column2='Y' AND
    column3 = (SELECT MAX(column3) FROM 
                  (SELECT * FROM records 
                   WHERE column1='X' AND column2='Y' AND column3 <= '<some_date>') r2);

Duration: 0.047 sec
Explain:

select_type	table	type	rows	extra
PRIMARY	r1	all	30k	using where
SUBQUERY	derived	all	30k
DEPENDENT SUBQUERY	r2	all	30k	using where

Both queries return same result, why is there such big difference in performance? Does it mysql 5.6 thing? I dont see any clue in explain, maybe i just dont understand it well.

CodePudding user response：

First, and most important, none of the columns are indexed. Each search of the table will result in at least one table scan. And the subqueries might have to scan the table for each match. That's bad. A composite index (column1, column2) would probably help immensely.

Second, MySQL 5.6 was end-of-lifed in 2021. Consider upgrading to at least MySQL 5.7 and probably MySQL 8.0 which adds many important features and may improve the query optimizer. In particular, MySQL 5.6 and 5.7 both say...

For certain cases, a correlated subquery is optimized... Otherwise, they are inefficient and likely to be slow. Rewriting the query as a join might improve performance.

Whereas MySQL 8 no longer has that warning instead saying...

Beginning with MySQL 8.0.24, the optimizer can transform a correlated scalar subquery to a derived table when the subquery_to_derived flag of the optimizer_switch variable is enabled.

I'd suggest trying your query on MySQL 8.0.

An optimizer trace would provide more information, but here's what I think is happening.

SELECT *
FROM records r1 
WHERE column1='X'
  AND column2='Y'
  AND column3 = (
    SELECT MAX(column3)
    FROM (
      SELECT *
      FROM records 
      WHERE column1='X'
        AND column2='Y'
        AND column3 <= '<some_date>'
    ) r2
  );

Note that the subquery where clause uses all hard coded values. That means it only needs to be run once and the value can be cached. That's a pretty simple optimization that the optimizer could be doing. We can think of it like so:

-- One table scan
SELECT @subquery:=MAX(column3)
FROM (
  SELECT *
  FROM records 
  WHERE column1='X'
    AND column2='Y'
    AND column3 <= '<some_date>'
  ) r2
);

-- One table scan
SELECT *
FROM records r1 
WHERE column1='X'
  AND column2='Y'
  AND column3 = @subquery;

With 30,000 rows, two table scans is just 60,000 rows.

That's one possibility. An optimizer trace would tell the whole story.

Now let's look at the first query.

SELECT *
FROM records r1 
WHERE column1='X'
  AND column2='Y'
  AND column3 = (
    SELECT MAX(column3)
    FROM records r2 
    WHERE r2.column1=r1.column1
      AND r2.column2=r1.column2
      AND r2.column3 <= '<some_date>'
    );

Again, because of the lack of indexes the outer query has to do a full table scan. The subquery must also do a full table scan, but because it is correlated it must do so for each matching row.

Let's assume the optimizer is smart and it checks that column1 and column2 match before checking column3. It must run the subquery for each matching row. Each run is a full table scan. Let's say there are 300 rows with matching column1 and column2. That's 301 table scans or 9 million rows. That would explain the performance difference.

The optimizer could be smarter and only run the subquery once on the subset of rows which match on column1 and column2, but it doesn't seem to be doing that. MySQL 8 might be able to better optimize the query.

If you don't want to repeat the where conditions in the subquery, try a self-join.

SELECT *
FROM records r1 
JOIN (
  SELECT column1, column2, MAX(column3) max3
  FROM records
  WHERE column3 <= '<some_date>'
  GROUP BY column1, column2
) r2 ON
  r1.column1 = r2.column1 AND
  r1.column2 = r2.column2 AND
  r1.column3 = r2.max3