Home > Software design >  FULL table scan in LEFT JOIN using OR
FULL table scan in LEFT JOIN using OR

Time:12-30

How can this query be optimized to avoid the full table scan described below?

I've got a slow query that's taking approximately 15 seconds to return.

Let's get this part out of the way - I've confirmed all indexes are there.

When I run EXPLAIN, it shows that there is a FULL TABLE scan ran on the crosswalk table (the index for fromQuestionCategoryJoinID is not used, even if I attempt to force) - if I remove either of the fields and the OR, the index is used and query completes in milliseconds.

SELECT c.id, c.name, GROUP_CONCAT(DISTINCT tags.externalDisplayID SEPARATOR ', ') AS tags
FROM checklist c
LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = qcheckj.questionID
LEFT JOIN questioncategoryjoin qcatjsub on qcatjsub.parentQuestionID = qcatj.questionID
LEFT JOIN crosswalk cw on (cw.fromQuestionCategoryJoinID = qcatj.id OR cw.fromQuestionCategoryJoinID = qcatjsub.id)
-- index used if I remove OR, eg.: LEFT JOIN crosswalk cw on (cw.fromQuestionCategoryJoinID = qcatj.id)
LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = cw.toQuestionCategoryJoinID
LEFT JOIN question tags on tags.id = qcj1.questionID
GROUP BY c.id
ORDER BY c.name, tags.externalDisplayID;

CodePudding user response:

Split the query into two queries for each part of the OR. Then combine them with UNION.

SELECT id, name, GROUP_CONCAT(DISTINCT externalDisplayID SEPARATOR ', ') AS tags
FROM (
    SELECT c.id, c.name, tags.externalDisplayID
    FROM checklist c
    LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
    LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = qcheckj.questionID
    LEFT JOIN crosswalk cw on cw.fromQuestionCategoryJoinID = qcatj.id
    LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = cw.toQuestionCategoryJoinID
    LEFT JOIN question tags on tags.id = qcj1.questionID

    UNION ALL

    SELECT c.id, c.name, tags.externalDisplayID
    FROM checklist c
    LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
    LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = qcheckj.questionID
    LEFT JOIN questioncategoryjoin qcatjsub on qcatjsub.parentQuestionID = qcatj.questionID
    LEFT JOIN crosswalk cw on cw.fromQuestionCategoryJoinID = qcatjsub.id
    LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = cw.toQuestionCategoryJoinID
    LEFT JOIN question tags on tags.id = qcj1.questionID
) AS x
GROUP BY x.id
ORDER BY x.name

Also, it doesn't make sense to include externalDisplayID in ORDER BY, because that will order by its value from a random row in the group. You could put ORDER BY externalDisplayID in the GROUP_CONCAT() arguments if that's what you want.

CodePudding user response:

There is a second inefficiency going on here. I call it "explode-implode". First a bunch of JOINs (potentially) expand the number of rows in an intermediate table, then GROUP BY c.id collapses the number of rows back to what you started with (one row of output per row of checkpoint).

Before trying to help with that, please answer:

  • Is LEFT really needed?
  • How many rows in each table? (Especially in cw)
  • Can you get rid of DISTINCT?

Barmar's answer can possibly be improved upon by delaying the JOINs to qcj1andtagsuntil after theUNION`:

SELECT ...
    FROM ( SELECT ...
               FROM first few tables
           UNION ALL
           SELECT ...
               FROM first few tables
         ) AS u
    [LEFT] JOIN qcj1
    [LEFT] JOIN tags
    GROUP BY ...
    ORDER BY ...

Another optimization (again building on Barmar's)

GROUP BY x.id
ORDER BY x.name

-->

GROUP BY x.name, x.id
ORDER BY x.name, x.id

When the items in GROUP BY and ORDER BY are the "same", they can be done in a single action, thereby saving (at least) a sort.

x.name, x.id is deterministic, where as x.name might put two rows with the same name in a different order, depending (perhaps) on the phase of the moon.

These indexes may help:

qcheckj:  INDEX(checklistID,  questionID)
qcatj:  INDEX(questionID,  id)
qcatjsub:  INDEX(parentQuestionID,  id)
cw:  INDEX(fromQuestionCategoryJoinID,  toQuestionCategoryJoinID)

CodePudding user response:

Try it this way:

SELECT c.id, c.name, GROUP_CONCAT(DISTINCT tags.externalDisplayID 
SEPARATOR ', ') AS tags
FROM checklist c
LEFT JOIN questionchecklistjoin qcheckj on qcheckj.checklistID = c.id
LEFT JOIN questioncategoryjoin qcatj ON qcatj.questionID = 
qcheckj.questionID
LEFT JOIN questioncategoryjoin qcatjsub on qcatjsub.parentQuestionID = 
qcatj.questionID
LEFT JOIN crosswalk cw on 
(cw.fromQuestionCategoryJoinID,cw.fromQuestionCategoryJoinID) = 
(qcatj.id,qcatjsub.id)
LEFT JOIN crosswalk cw on (cw.fromQuestionCategoryJoinID = qcatj.id)
LEFT JOIN questioncategoryjoin qcj1 on qcj1.id = 
cw.toQuestionCategoryJoinID
LEFT JOIN question tags on tags.id = qcj1.questionID
GROUP BY c.id
ORDER BY c.name, tags.externalDisplayID;

This removes the OR and achieve the same .

  • Related