I have a somewhat complex (to me) query where I am joining three tables. I have been steadily trying to optize it, reading how to improve things by looking at the EXPLAIN output.
One of the tables person_deliveries
is growing by one to two million records per day, so the query is taking longer and longer due to my poor optimization. Any insight would be GREATLY appreciated.
Here is the query:
SELECT
DATE(pdel.date) AS date,
pdel.ip_address AS ip_address,
pdel.sending_campaigns_id AS campaigns_id,
(substring_index(pe.email, '@', -1)) AS recipient_domain,
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
COUNT(CASE WHEN pdel.ip_address = pc.ip_address AND pdel.sending_campaigns_id = pc.campaigns_id AND pdel.emails_id = pc.emails_id THEN pdel.emails_id ELSE NULL END) AS complaints
FROM
person_deliveries pdel
LEFT JOIN person_complaints pc on pc.ip_address = pdel.ip_address
LEFT JOIN person_emails pe ON pe.id = pdel.emails_id
WHERE
(pdel.date >= '2022-03-11' AND pdel.date <= '2022-03-12')
AND pe.id IS NOT NULL
AND pdel.ip_address is NOT NULL
GROUP BY date(pdel.date), pdel.ip_address, pdel.sending_campaigns_id
ORDER BY date(pdel.date), INET_ATON(pdel.ip_address), pdel.sending_campaigns_id ASC ;
Here is the output of EXPLAIN
:
---- ------------- ------- ------------ -------- ------------------------------------------------ ------------ --------- ---------------------------- --------- ---------- ---------------------------------------------------------------------
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---- ------------- ------- ------------ -------- ------------------------------------------------ ------------ --------- ---------------------------- --------- ---------- ---------------------------------------------------------------------
| 1 | SIMPLE | pdel | NULL | range | person_campaign_date,ip_address,date,emails_id | date | 5 | NULL | 2333678 | 50.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | pe | NULL | eq_ref | PRIMARY | PRIMARY | 4 | subscriber.pdel.emails_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | pc | NULL | ref | ip_address | ip_address | 18 | subscriber.pdel.ip_address | 128 | 100.00 | NULL |
---- ------------- ------- ------------ -------- ------------------------------------------------ ------------ --------- ---------------------------- --------- ---------- ---------------------------------------------------------------------
I added a few indexes to get it to this point, but the query still takes an extraordinary amount of resources/time to process.
I know I am missing something here, either an index or using a function that is causing it to be slow, but from everything I have read I haven't figured it out yet.
CodePudding user response:
Too many questions vs comments. It appears for date criteria you are only pulling for a SINGLE date. Is this always the case?, or just this sample. Your pdel.date. Is it a date or date/time as stored. Your query is doing >= '2022-03-11' AND <= '2022-03-12'. Is this because your are trying to get up to and including 2022-03-11 at 11:59:59pm? And if so, should it be LESS than 03-12?
If your counts are based on a single day basis, and this data is rather fixed... that is you are not going to be changing deliveries, etc. on a day that has already passed. This might be a candidate condition for having a stored aggregate table that is done on a daily basis. This way when you are looking for activity patterns, you can have the non-changing aggregates already done and just go against that. Then if you need the details, go back to the raw data.
CodePudding user response:
These indexes are "covering", which should help some:
pdel: INDEX(date, ip_address, sending_campaigns_id, emails_id)
pc: INDEX(ip_address, campaigns_id, emails_id)
Assuming date
is a DATETIME
, this contains an extra midnight:
WHERE pdel.date >= '2022-03-11'
AND pdel.date <= '2022-03-12'
I prefer the pattern:
WHERE pdel.date >= '2022-03-11'
AND pdel.date < '2022-03-11' INTERVAL 1 DAY
When the GROUP BY
and ORDER BY
are different, an extra sort is (usually) required. So, write the GROUP BY
to be just like the ORDER BY
(after removing "ASC").
A minor simplification (and speedup):
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
-->
COUNT(DISTINCT, pdel.emails_id, pdel.date) AS deliveries,
Consider storing the numeric version of the IPv4 in INT UNSIGNED
(only 4 bytes) instead of a VARCHAR
. It will be smaller and you can eliminate some conversions, but will add an INET_NTOA
in the SELECT
.
The COUNT(CASE ... )
can be simplified to
SUM( pdel.ip_address = pc.ip_address
AND pdel.sending_campaigns_id = pc.campaigns_id
AND pdel.emails_id = pc.emails_id ) AS complaints
In
(substring_index(pe.email, '@', -1)) AS recipient_domain,
I think it should be 1
, not -1
or the alias is 'wrong'.
Please change LEFT JOIN pe ... WHERE pe.id IS NOT NULL
to equivalent, but simpler JOIN pe
without the null test.
Sorry, but those will not provide a huge performance improvement. The next step would be to build and maintain a Summary Tables and use that to generate the desired 'report'. (See DRapp's Answer.)