Is using SELECT DISTINCT without UNION more efficient? Is there a way to optimize the following quer-CodePudding

I have 3 tables. Each of them contain (among others) the columns UserID and Period.

In order to get all the Periods associated with a specific user, I'm using unions like this:

# Original Query
SELECT Period FROM table_1 WHERE UserID = :user
UNION SELECT Period FROM table_2 WHERE UserID = :user
UNION SELECT Period FROM table_3 WHERE UserID = :user
ORDER BY Period ASC;

The results I get are what I expect. Since I'm not using UNION ALL, all duplicate results get automatically filtered out.

The debugger this project is using thinks the query MIGHT be slow. This is the query plan for it-

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	PRIMARY	table_1	ref	UserID	UserID	4	const	29	NULL
2	UNION	table_2	ref	UserID	UserID	4	const	5	NULL
3	UNION	table_3	ref	UserID	UserID	4	const	4	NULL
NULL	UNION RESULT	<union1,2,3>	ALL	NULL	NULL	NULL	NULL	NULL	Using temporary

I thought maybe using SELECT DISTINCT instead of SELECT would help since each SELECT result set would get smaller before the UNION.

# Updated Query
SELECT DISTINCT Period FROM table_1 WHERE UserID = :user
UNION SELECT DISTINCT Period FROM table_2 WHERE UserID = :user
UNION SELECT DISTINCT Period FROM table_3 WHERE UserID = :user
ORDER BY Period ASC;

However this only seems to make the query plan use more temp tables.

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	PRIMARY	table_1	ref	UserID	UserID	4	const	29	Using where; Using temporary
2	UNION	table_2	ref	UserID	UserID	4	const	5	Using where; Using temporary
3	UNION	table_3	ref	UserID	UserID	4	const	4	Using where; Using temporary
NULL	UNION RESULT	<union1,2,3>	ALL	NULL	NULL	NULL	NULL	NULL	Using temporary

I've also tried running this same query repeating lines 2-3 (the UNIONs) across 500 lines to see if there were noticeable differences. Time-wise, the results were pretty similar.

Analyzing the queries, using DISTINCT seemed to make the queries take less time returning the results. However, the query now had to clean up all temp tables, which ended up making the final time for both queries very similar.

My question(s) to all MySQL gurus, is there a way to make this query faster or more optimized without changing the table structures?

Maybe with some indexes? (On that note, Period is a VARCHAR and the tables are all MyISAM)

Do the order of the tables matter in the query? I purposedly tried to place the biggest table first.

CodePudding user response：

# Original Query
SELECT Period FROM table_1 WHERE UserID = :user
UNION SELECT Period FROM table_2 WHERE UserID = :user
UNION SELECT Period FROM table_3 WHERE UserID = :user
ORDER BY Period ASC;

It is processed as:

Execute subquery 1 (output A)
Execute subquery 2 (output B)
Combine outputs A and B (output С)
Sort combined output C and remove duplicates (output D)
Execute subquery 3 (output E)
Combine outputs C and E (output F)
Sort combined output E and remove duplicates (output G)
Return output G

# Updated Query
SELECT DISTINCT Period FROM table_1 WHERE UserID = :user
UNION SELECT DISTINCT Period FROM table_2 WHERE UserID = :user
UNION SELECT DISTINCT Period FROM table_3 WHERE UserID = :user
ORDER BY Period ASC;

It is processed as:

Execute subquery 1 (output A)
Sort output A and remove duplicates (output B)
Execute subquery 2 (output C)
Sort output C and remove duplicates (output D)
Combine outputs B and D (output E)
Sort output E and remove duplicates (output F)
Execute subquery 3 (output G)
Sort output G and remove duplicates (output H)
Combine outputs F and H (output I)
Sort output I and remove duplicates (output J)
Return output J

# Recommended Query
SELECT DISTINCT Period
FROM ( SELECT Period FROM table_1 WHERE UserID = :user
       UNION ALL
       SELECT Period FROM table_2 WHERE UserID = :user
       UNION ALL
       SELECT Period FROM table_3 WHERE UserID = :user ) AS total
ORDER BY Period ASC;

Execute subquery 1 (output A)
Execute subquery 2 (output B)
Combine outputs A and B (output С)
Execute subquery 3 (output E)
Combine outputs (output F)
Sort combined output and remove duplicates (output G)
Return output G

CodePudding user response：

If you have only a few dozen rows in the result, it will be hard to notice whether it is "slow".

Most of the tradeoffs discussed here lead to a conclusion of "it depends".

Suppose there were 1000 rows for the user in each table, but only 10 rows in the result. One formulation would be faster than the other.

Suppose there were almost no dups between the tables. Now the other formulation might be faster.

Since you desire do have the de-dupping, there will be a temp table somewhere.

I suggest you write the query in whatever way is "simplest" or "most logical" to you.

Also, if you want performance, switch from MyISAM to InnoDB. In almost all benchmarks, InnoDB is at least as fast.