Optimize Mysql query with two joins with explain-CodePudding

I have a somewhat complex (to me) query where I am joining three tables. I have been steadily trying to optize it, reading how to improve things by looking at the EXPLAIN output.

One of the tables person_deliveries is growing by one to two million records per day, so the query is taking longer and longer due to my poor optimization. Any insight would be GREATLY appreciated.

Here is the query:

SELECT 
    DATE(pdel.date) AS date, 
    pdel.ip_address AS ip_address, 
    pdel.sending_campaigns_id AS campaigns_id, 
    (substring_index(pe.email, '@', -1)) AS recipient_domain, 
    COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries, 
    COUNT(CASE WHEN pdel.ip_address = pc.ip_address AND pdel.sending_campaigns_id = pc.campaigns_id AND pdel.emails_id = pc.emails_id THEN pdel.emails_id ELSE NULL END) AS complaints 
    FROM 
        person_deliveries pdel 
        LEFT JOIN person_complaints pc on pc.ip_address = pdel.ip_address
        LEFT JOIN person_emails pe ON pe.id = pdel.emails_id
    WHERE 
        (pdel.date >= '2022-03-11' AND pdel.date <= '2022-03-12')
        AND pe.id IS NOT NULL 
        AND pdel.ip_address is NOT NULL 
    GROUP BY date(pdel.date), pdel.ip_address, pdel.sending_campaigns_id 
    ORDER BY date(pdel.date), INET_ATON(pdel.ip_address), pdel.sending_campaigns_id ASC ;

Here is the output of EXPLAIN:

 ---- ------------- ------- ------------ -------- ------------------------------------------------ ------------ --------- ---------------------------- --------- ---------- --------------------------------------------------------------------- 
| id | select_type | table | partitions | type   | possible_keys                                  | key        | key_len | ref                        | rows    | filtered | Extra                                                               |
 ---- ------------- ------- ------------ -------- ------------------------------------------------ ------------ --------- ---------------------------- --------- ---------- --------------------------------------------------------------------- 
|  1 | SIMPLE      | pdel  | NULL       | range  | person_campaign_date,ip_address,date,emails_id | date       | 5       | NULL                       | 2333678 |    50.00 | Using index condition; Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | pe    | NULL       | eq_ref | PRIMARY                                        | PRIMARY    | 4       | subscriber.pdel.emails_id  |       1 |   100.00 | NULL                                                                |
|  1 | SIMPLE      | pc    | NULL       | ref    | ip_address                                     | ip_address | 18      | subscriber.pdel.ip_address |     128 |   100.00 | NULL                                                                |
 ---- ------------- ------- ------------ -------- ------------------------------------------------ ------------ --------- ---------------------------- --------- ---------- ---------------------------------------------------------------------

I added a few indexes to get it to this point, but the query still takes an extraordinary amount of resources/time to process.

I know I am missing something here, either an index or using a function that is causing it to be slow, but from everything I have read I haven't figured it out yet.

CodePudding user response：

Too many questions vs comments. It appears for date criteria you are only pulling for a SINGLE date. Is this always the case?, or just this sample. Your pdel.date. Is it a date or date/time as stored. Your query is doing >= '2022-03-11' AND <= '2022-03-12'. Is this because your are trying to get up to and including 2022-03-11 at 11:59:59pm? And if so, should it be LESS than 03-12?

If your counts are based on a single day basis, and this data is rather fixed... that is you are not going to be changing deliveries, etc. on a day that has already passed. This might be a candidate condition for having a stored aggregate table that is done on a daily basis. This way when you are looking for activity patterns, you can have the non-changing aggregates already done and just go against that. Then if you need the details, go back to the raw data.

CodePudding user response：

These indexes are "covering", which should help some:

pdel:  INDEX(date, ip_address, sending_campaigns_id,  emails_id)
pc:  INDEX(ip_address,  campaigns_id, emails_id)

Assuming date is a DATETIME, this contains an extra midnight:

    WHERE  pdel.date >= '2022-03-11'
      AND  pdel.date <= '2022-03-12'

I prefer the pattern:

    WHERE  pdel.date >= '2022-03-11'
      AND  pdel.date  < '2022-03-11'   INTERVAL 1 DAY

When the GROUP BY and ORDER BY are different, an extra sort is (usually) required. So, write the GROUP BY to be just like the ORDER BY (after removing "ASC").

A minor simplification (and speedup):

   COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,

-->

   COUNT(DISTINCT, pdel.emails_id, pdel.date) AS deliveries,

Consider storing the numeric version of the IPv4 in INT UNSIGNED (only 4 bytes) instead of a VARCHAR. It will be smaller and you can eliminate some conversions, but will add an INET_NTOA in the SELECT.

The COUNT(CASE ... ) can be simplified to

SUM(    pdel.ip_address = pc.ip_address
   AND  pdel.sending_campaigns_id = pc.campaigns_id
   AND  pdel.emails_id = pc.emails_id ) AS complaints

(substring_index(pe.email, '@', -1)) AS recipient_domain,

I think it should be 1, not -1 or the alias is 'wrong'.

Please change LEFT JOIN pe ... WHERE pe.id IS NOT NULL to equivalent, but simpler JOIN pe without the null test.

Sorry, but those will not provide a huge performance improvement. The next step would be to build and maintain a Summary Tables and use that to generate the desired 'report'. (See DRapp's Answer.)