Optimizing Requests parsing a 50 mil rows MySQL table-CodePudding

I have a request that I need to process multiple time during specific time frames, my implementation is working, but my user base is growing everyday and the CPU load of my database and the time taken to execute the query is getting bigger everyday

Here is the request:

SELECT bill.* FROM billing bill
            INNER JOIN subscriber s ON (s.subscriber_id = bill.subscriber_id) 
            INNER JOIN subscription sub ON(s.subscriber_id = sub.subscriber_id)
            WHERE s.status = 'C' 
            AND bill.subscription_id = sub.subscription_id                      
            AND sub.renewable = 1
            AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
            AND sub.store = 'BizaoStore'
            AND (sub.purchase_token = 'myservice' or sub.purchase_token = 'myservice_wait' ) 
            AND bill.billing_date > '2022-12-31 07:00:00' AND bill.billing_date < '2023-01-01 10:00:00'
            AND (bill.billing_value = 'not_ok bizao_tobe' or bill.billing_value =  'not_ok BILL010 2' or bill.billing_value =  'not_ok BILL010' or bill.billing_value = 'not_ok BILL010 3')
            AND (SELECT MAX(bill2.billing_date)
                FROM billing bill2
                WHERE bill2.subscriber_id = bill.subscriber_id
                AND bill2.subscription_id = bill.subscription_id 
                AND bill2.billing_value = 'not_ok bizao_tobe') 
            = bill.billing_date order by sub.created_at DESC LIMIT 300;

This request is executed in two different servers, each serve handle a specific service. In each server, the request runs 8 times per minut (for around 3 hours) and each of the 8 times has this line with different hours:

AND (hour(sub.created_at) >  1 AND hour(sub.created_at) < 5 )

I did this so I can split my user base in 8 and process the requests more efficiently. Also I need to only handle 300 users at a time because the third party server I have to call for each user is not very stable and can sometime take very long to respond

The billing tables counts around 50.000.000 entries, here is the schema of the columns and indexes:

Subscriber table is around 2.000.000, columns scheme and indexes:

And finally subscription table, 2.500.000 rows, scheme and indexes:

As a few more infos, I noticed during my tests for optimisation that if I add in my request the fact that I want datas with a "billing_id" over a specific ID, it will run very fast. Basically I think whats taking the most time is parsing the 50.000.000 lines tables.

I did (or at least I tried) to optimize my request with time to be more efficient, but as of now I'm a little bit stuck with it.

Mysql version is 5.7.38

Thanks for your help

CodePudding user response：

I think there could be some rooms for improvement, some are easier to do some takes more efforts:

You mentioned you tried to improve the performance with limiting the time, which I assume mean AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 ), however sub.created_at does not have an index. It is a good idea created_at and updated_at columns always have indexes on in the database.
Additionally you can add an index on bill.billing_date
There is a sub-query finding MAX(bill2.billing_date). This can be a bottleneck too, as this subquery needs to run for each row of the previous stage of execution (you can check that with explain query). Depending on your condition you can store that max field somewhere, like a column in database, cache, etc. and keep updating it every time the underlying table's values change. You can do it with database triggers (which some people consider it anti pattern) or use the transaction inside your code.
You can run the query against a replica and do not care if it takes a lot of time.

Having all said, the first point might improve the performance a lot, and hence making the rest not necessary

CodePudding user response：

I see a few opportunities to speed up this query. (Reference: Markus Winand's https://use-the-index-luke.com/ e-book.)

Replace your correlated subquery (SELECT MAX(bill2.billing_date)...) with an independent subquery.
Try to make all your WHERE clauses sargable -- able to exploit indexes.
Add appropriate indexes.

Independent subquery Obtain latest billing date for each subscriber / subscription like this. This query need only run once, whereas the correlated subquery you have runs many times.

      SELECT MAX(billing_date) billing_date,
             subscriber_id,
             subscription_id
        FROM billing
       WHERE billing_value = 'not_ok bizao_tobe'
       GROUP BY subscriber_id, subscription_id

Use this index to speed up the subquery. This index allows the subquery to be satisfied with an almost miraculously fast loose index scan.

CREATE INDEX value_subscriber_subscription_date ON billing
   (billing_value, subscriber_id, subscription_id, billing_date DESC);

Then rewrite your overall query like this to use it. I have rewritten a few other things here as well to improve readability: mostly changing col = a OR col = b OR col = c to col IN (a,b,c). I also changed the order of some WHERE clauses, again for readability. The order of WHERE clauses doesn't matter to performance.

SELECT bill.* 
  FROM billing bill
  JOIN subscriber s ON s.subscriber_id = bill.subscriber_id 
  JOIN subscription sub   ON s.subscriber_id = sub.subscriber_id
                         AND bill.subscription_id = sub.subscription_id
  JOIN (
      SELECT MAX(billing_date) billing_date,
             subscriber_id,
             subscription_id
        FROM billing
       WHERE billing_value = 'not_ok bizao_tobe'
       GROUP BY subscriber_id, subscription_id
       ) latest   ON bill.subscriber_id = latest.subscriber_id
                 AND bill.subscription_id = latest.subscription_id 
                 AND bill.billing_date = latest.billing_date                 
 WHERE s.status = 'C' 
   AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
   AND sub.renewable = 1
   AND sub.store = 'BizaoStore'
   AND sub.purchase_token IN ('myservice', 'myservice_wait' ) 
   AND bill.billing_value IN (
      'not_ok bizao_tobe', 'not_ok BILL010 2', 
      'not_ok BILL010', 'not_ok BILL010 3')
   AND bill.billing_date > '2022-12-31 07:00:00' 
   AND bill.billing_date < '2023-01-01 10:00:00'
 ORDER BY sub.created_at DESC
 LIMIT 300;

Sargability Your segmenting of your user base by hours of the day means you need this clause, as you pointed out.

AND (HOUR(sub.created_at) >  1 AND HOUR(sub.created_at) < 5 )

This clause applies the HOUR() function to every eligible row so it has to scan through the rows looking at them all. Slow. Add a virtual column called created_hour to your subscription table. We'll put an index on the column in a moment.

ALTER TABLE subscription 
     ADD COLUMN created_hour TINYINT 
     GENERATED ALWAYS AS (HOUR(created_at)) VIRTUAL;

Then start using the virtual column to segment your users.

AND (sub.created_hour >  1 AND sub.created_hour < 5)

Indexes Compound (multicolumn) indexes are the way to speed up complex queries like yours. The order of columns matters in indexes. The columns with equality matches go first, then a column with a range match.

First, let's put a compound index on your subscription table that matches the requirements of your query. We'll index our new virtual column in the process. This lets the query planner find your batches efficiently.

The last column in this index is created_at. That speeds up your ORDER BY ... LIMIT operation.

CREATE INDEX renewable_store_token_hour_created 
     ON subscription (renewable, store, purchase_token,
                      created_hour, created_at);

Next, let's look at how you use billing in your main query. (We already added an index to help the subquery). You match for equality on billing_value and then by a date range on billing_date. So you need this index.

CREATE INDEX value_date ON billing (billing_value, billing_date);

You already have the index you need on subscriber.