I have a request that I need to process multiple time during specific time frames, my implementation is working, but my user base is growing everyday and the CPU load of my database and the time taken to execute the query is getting bigger everyday
Here is the request:
SELECT bill.* FROM billing bill
INNER JOIN subscriber s ON (s.subscriber_id = bill.subscriber_id)
INNER JOIN subscription sub ON(s.subscriber_id = sub.subscriber_id)
WHERE s.status = 'C'
AND bill.subscription_id = sub.subscription_id
AND sub.renewable = 1
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
AND sub.store = 'BizaoStore'
AND (sub.purchase_token = 'myservice' or sub.purchase_token = 'myservice_wait' )
AND bill.billing_date > '2022-12-31 07:00:00' AND bill.billing_date < '2023-01-01 10:00:00'
AND (bill.billing_value = 'not_ok bizao_tobe' or bill.billing_value = 'not_ok BILL010 2' or bill.billing_value = 'not_ok BILL010' or bill.billing_value = 'not_ok BILL010 3')
AND (SELECT MAX(bill2.billing_date)
FROM billing bill2
WHERE bill2.subscriber_id = bill.subscriber_id
AND bill2.subscription_id = bill.subscription_id
AND bill2.billing_value = 'not_ok bizao_tobe')
= bill.billing_date order by sub.created_at DESC LIMIT 300;
This request is executed in two different servers, each serve handle a specific service. In each server, the request runs 8 times per minut (for around 3 hours) and each of the 8 times has this line with different hours:
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
I did this so I can split my user base in 8 and process the requests more efficiently. Also I need to only handle 300 users at a time because the third party server I have to call for each user is not very stable and can sometime take very long to respond
The billing tables counts around 50.000.000 entries, here is the schema of the columns and indexes:
Subscriber table is around 2.000.000, columns scheme and indexes:
And finally subscription table, 2.500.000 rows, scheme and indexes:
As a few more infos, I noticed during my tests for optimisation that if I add in my request the fact that I want datas with a "billing_id" over a specific ID, it will run very fast. Basically I think whats taking the most time is parsing the 50.000.000 lines tables.
I did (or at least I tried) to optimize my request with time to be more efficient, but as of now I'm a little bit stuck with it.
Mysql version is 5.7.38
Thanks for your help
CodePudding user response:
I think there could be some rooms for improvement, some are easier to do some takes more efforts:
- You mentioned you tried to improve the performance with limiting the time, which I assume mean
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
, howeversub.created_at
does not have an index. It is a good ideacreated_at
andupdated_at
columns always have indexes on in the database. - Additionally you can add an index on
bill.billing_date
- There is a sub-query finding
MAX(bill2.billing_date)
. This can be a bottleneck too, as this subquery needs to run for each row of the previous stage of execution (you can check that withexplain
query). Depending on your condition you can store that max field somewhere, like a column in database, cache, etc. and keep updating it every time the underlying table's values change. You can do it with database triggers (which some people consider it anti pattern) or use the transaction inside your code. - You can run the query against a replica and do not care if it takes a lot of time.
Having all said, the first point might improve the performance a lot, and hence making the rest not necessary
CodePudding user response:
I see a few opportunities to speed up this query. (Reference: Markus Winand's https://use-the-index-luke.com/ e-book.)
- Replace your correlated subquery (
SELECT MAX(bill2.billing_date)...
) with an independent subquery. - Try to make all your
WHERE
clauses sargable -- able to exploit indexes. - Add appropriate indexes.
Independent subquery Obtain latest billing date for each subscriber / subscription like this. This query need only run once, whereas the correlated subquery you have runs many times.
SELECT MAX(billing_date) billing_date,
subscriber_id,
subscription_id
FROM billing
WHERE billing_value = 'not_ok bizao_tobe'
GROUP BY subscriber_id, subscription_id
Use this index to speed up the subquery. This index allows the subquery to be satisfied with an almost miraculously fast loose index scan.
CREATE INDEX value_subscriber_subscription_date ON billing
(billing_value, subscriber_id, subscription_id, billing_date DESC);
Then rewrite your overall query like this to use it. I have rewritten a few other things here as well to improve readability: mostly changing col = a OR col = b OR col = c
to col IN (a,b,c)
. I also changed the order of some WHERE clauses, again for readability. The order of WHERE clauses doesn't matter to performance.
SELECT bill.*
FROM billing bill
JOIN subscriber s ON s.subscriber_id = bill.subscriber_id
JOIN subscription sub ON s.subscriber_id = sub.subscriber_id
AND bill.subscription_id = sub.subscription_id
JOIN (
SELECT MAX(billing_date) billing_date,
subscriber_id,
subscription_id
FROM billing
WHERE billing_value = 'not_ok bizao_tobe'
GROUP BY subscriber_id, subscription_id
) latest ON bill.subscriber_id = latest.subscriber_id
AND bill.subscription_id = latest.subscription_id
AND bill.billing_date = latest.billing_date
WHERE s.status = 'C'
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
AND sub.renewable = 1
AND sub.store = 'BizaoStore'
AND sub.purchase_token IN ('myservice', 'myservice_wait' )
AND bill.billing_value IN (
'not_ok bizao_tobe', 'not_ok BILL010 2',
'not_ok BILL010', 'not_ok BILL010 3')
AND bill.billing_date > '2022-12-31 07:00:00'
AND bill.billing_date < '2023-01-01 10:00:00'
ORDER BY sub.created_at DESC
LIMIT 300;
Sargability Your segmenting of your user base by hours of the day means you need this clause, as you pointed out.
AND (HOUR(sub.created_at) > 1 AND HOUR(sub.created_at) < 5 )
This clause applies the HOUR() function to every eligible row so it has to scan through the rows looking at them all. Slow. Add a virtual column called created_hour
to your subscription
table. We'll put an index on the column in a moment.
ALTER TABLE subscription
ADD COLUMN created_hour TINYINT
GENERATED ALWAYS AS (HOUR(created_at)) VIRTUAL;
Then start using the virtual column to segment your users.
AND (sub.created_hour > 1 AND sub.created_hour < 5)
Indexes Compound (multicolumn) indexes are the way to speed up complex queries like yours. The order of columns matters in indexes. The columns with equality matches go first, then a column with a range match.
First, let's put a compound index on your subscription
table that matches the requirements of your query. We'll index our new virtual column in the process. This lets the query planner find your batches efficiently.
The last column in this index is created_at
. That speeds up your ORDER BY ... LIMIT
operation.
CREATE INDEX renewable_store_token_hour_created
ON subscription (renewable, store, purchase_token,
created_hour, created_at);
Next, let's look at how you use billing
in your main query. (We already added an index to help the subquery). You match for equality on billing_value
and then by a date range on billing_date
. So you need this index.
CREATE INDEX value_date ON billing (billing_value, billing_date);
You already have the index you need on subscriber
.