This is a follow-up optimization on Adding a SUM group by into a COUNT DISTINCT query (albeit with some optimizations and joins simplifications).
I wonder if it's possible to optimize the following PostgreSQL 13.1 query which took 130322.2ms
to complete. Normally if only one JOIN LATERAL
is present it does it in a few ms.
What I'm most lost is given that each JOIN LATERAL
has an ON
with a condition based on the score of its own subquery, how could I optimize the query potentially reducing the number of JOIN LATERAL
and still get the same results.
From what I see, it seems it gets slow when conditional OR
are added to some of the WHERE inside the JOIN LATERAL
instead of AND
. See:
SELECT count(*)
FROM subscriptions q
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
AND ts.tag_id = 21
) AS q62958 ON q62958.sum_score <= 1
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
OR ts.tag_id = 32
) AS q120342 ON q120342.sum_score <= 1
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
OR ts.tag_id = 35
) AS q992506 ON q992506.sum_score <= 1
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
OR ts.tag_id = 33
) AS q343255 ON q343255.sum_score <= 1
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
OR ts.tag_id = 29
) AS q532052 ON q532052.sum_score <= 1
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
OR ts.tag_id = 30
) AS q268437 ON q268437.sum_score <= 1
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
AND ts.tag_id = 46
) AS q553964 ON q553964.sum_score >= 3
JOIN LATERAL (
SELECT
SUM(ts.score) AS sum_score
FROM
quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE
qa.quiz_id = q.quiz_id
AND ts.tag_id = 24
) AS q928243 ON q928243.sum_score >= 2
WHERE
q.state = 'subscribed' AND q.app_id = 4
;
The subscriptions
table has less than 15000 rows and less than 2000 matching the WHERE
clause. Both q.state
and q.app_id
have indexes.
The complete EXPLAIN ANALYZE
: https://explain.depesz.com/s/Ok0h
CodePudding user response:
The main issue is that the query is wrong:
WHERE
qa.quiz_id = q.quiz_id
OR ts.tag_id = 32
The OR
is misplaced here and must be introduced after aggregating the right rows. This WHERE
clause includes all rows with either a matching quiz_id
or tag_id = 32
. So all rows, which is nonsense.
In addition to that, you can conflate multiple LATERAL
subqueries with conditional aggregates like this:
SELECT count(*)
FROM subscriptions q
JOIN LATERAL (
SELECT sum(ts.score) FILTER (WHERE ts.tag_id = 21) AS sum_score21
, sum(ts.score) FILTER (WHERE ts.tag_id = 32) AS sum_score32
, sum(ts.score) FILTER (WHERE ts.tag_id = 35) AS sum_score35
-- , more?
FROM quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE qa.quiz_id = q.quiz_id
AND ts.tag_id IN (21, 32, 35) -- more?
) AS t ON t.sum_score21 <= 1
OR t.sum_score32 <= 1
OR t.sum_score35 <= 1
-- AND / OR more?
WHERE q.state = 'subscribed'
AND q.app_id = 4;
Be aware of operator precedence when adding more conditions with AND
or OR
: You may need parentheses as AND
binds before OR
.
About aggregated FILTER
:
A multicolumn index on subscriptions(app_id, state, quizz_id)
might help (give you index-only scans). But since the table isn't that big, that's not important.
LATERAL
(rather than a plain subquery) still makes sense while the outer filter eliminates most rows from table subscriptions
. A multicolumn index on tag_scores(answer_id, tag_id)
might help.
With more tags in the subquery and/or more subscriptions, the LATERAL
variant and the usefulness of said index fall off.
For comparison, here is a variant with plain subquery:
SELECT count(*)
FROM subscriptions q
JOIN (
SELECT qa.quiz_id
, sum(ts.score) FILTER (WHERE ts.tag_id = 21) AS sum_score21
, sum(ts.score) FILTER (WHERE ts.tag_id = 32) AS sum_score32
, sum(ts.score) FILTER (WHERE ts.tag_id = 35) AS sum_score35
-- , more?
FROM quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
WHERE ts.tag_id IN (21, 32, 35) -- more?
GROUP BY qa.quiz_id
) AS t USING (quiz_id)
WHERE q.state = 'subscribed'
AND q.app_id = 4
AND (t.sum_score21 <= 1
OR t.sum_score32 <= 1
OR t.sum_score35 <= 1)
-- AND / OR more?
;
Either way, t.sum_score21 <= 1
qualifies if ...
- at least one row with tag 21 is associated
- and the summed score of all with tag 21 for the same quiz is <= 1
Seems like a pretty narrow filter.
Denoise:
If the row in answers
is never missing (referential integrity enforced with FK constraints?), you can cut out the middleman here:
FROM quiz_answers qa
JOIN answers a ON a.id = qa.answer_id
JOIN tag_scores ts ON ts.answer_id = a.id
-->
FROM quiz_answers qa
JOIN tag_scores ts USING (answer_id)