Our front-end system has the ability to define dynamic SQL queries using AND/OR/NOT from multiple tables, and the core system works fine - but it's slows down to being unusable due to the compounded scanning of IN. For the life of me, I can't figure out how to have this level of dynamic functionality without using IN. Below is the code that works perfectly fine (the filter matching is ultra fast), but the compounding of the IN scan takes > 60 seconds because it's 50,000 records for some of the filter returns.
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('mike meyers' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT phw.person_main_ref_id AS ID
FROM person_history_work phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
),
filter2 AS
(
SELECT pa.person_main_ref_id AS ID
FROM person_address pa
WHERE pa.global_address_details_ref_id IN
(
SELECT gad.ID
FROM global_address_details gad
WHERE gad.address_city LIKE '%seattle%'
)
),
all_indexes AS
(
SELECT ID FROM filter0
UNION (SELECT ID FROM filter1)
UNION (SELECT ID FROM filter2)
),
person_filter AS
(
SELECT ai.ID
FROM all_indexes ai
WHERE
(
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID NOT IN (SELECT ID FROM filter1)
OR ai.ID IN (SELECT ID FROM filter2)
)
)
SELECT (JSON_ARRAYAGG(pf.ID)) FROM person_filter pf;
Filter 0 has 461 records, Filter 1 has 48480 and Filter 2 has 750.
The key issue is with the WHERE statement; because the front-end can say AND/OR and NOT on any "joined" query.
So if I change it to:
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID IN (SELECT ID FROM filter1)
AND ai.ID IN (SELECT ID FROM filter2)
The query takes more than 60 seconds. Because it's scanning 461 * 48480 * 750 = 16,761,960,00. UGH.
Of course I could hardcode around this if it was a static stored procedure or call, but it's a dynamic interpolative system that takes the settings defined by the user, so the user can define the above.
As you can see what I do is create a list of all indexes involved, then select them based on the AND/OR/NOT values as defined by the front-end web tool.
Obviously IN won't work for this; the question is what other techniques could I use that don't involve the use of IN that would allow the same level of flexibility with AND/OR/NOT?
Update for @BillKarwin in Comments
So the below code works well for executing an AND, NOT and OR:
SELECT pm.ID
FROM person_main pm
JOIN filter0 f0 ON f0.ID = pm.ID -- AND
LEFT JOIN filter1 f1 ON pm.ID = f1.ID WHERE f1.ID IS NULL -- NOT
UNION (SELECT ID FROM filter2) -- OR
I believe I can make this work with our system; I just need to store the different types (AND/NOT/OR) and execute them in process; let me do some updates and I'll get back to you.
CodePudding user response:
Avoid IN ( SELECT ... )
-- Use JOIN
or EXISTS
Avoid SELECT ID FROM ( SELECT ID FROM .... )
-- The outer SELECT is unnecessary.
Move UNION
to the outer level (in some situations)
all_indexes
seems to simplify to
( SELECT phw.person_main_ref_id AS ID
FROM person_history_work AS phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
) UNION ALL
( SELECT gad.ID
FROM global_address_details AS gad
WHERE gad.address_city LIKE '%seattle%'
)
Can you change the last part to WHERE address_city = 'seattle'
? If so, then you can use INDEX(address_city)
If not, would a FULLTEXT index together with MATCH work for you?
See if you can follow my lead and simplify the rest.
WITH
was only recently added to the MySQL's syntax. I suspect it will take another release or two before it is well optimized; try to avoid WITH
. Since you are "building" the query, you can "build" UNION
, LEFT JOIN
, etc, as needed.
CodePudding user response:
As discussed in the comments above:
Logically, you can replace a lot of your subqueries with JOIN when they are AND terms of your expression, or UNION when they are OR terms of your expression. Also learn about exclusion joins.
But that doesn't necessarily mean that the queries will run faster, unless you have created indexes to support the join conditions and the user-defined conditions.
But which indexes should you create?
Ultimately, it's not possible to optimize all dynamic queries that users come up with. You may be able to run their queries (as you are already doing), but they won't be efficient.
It's kind of a losing game to allow users to specify arbitrary conditions. It's better to give them a fixed set of choices, which are types of queries that you have taken the time to optimize. Then allow them to run a "user-specified" query, but label it clearly that it is not optimized and it will likely take a long time.