I want to find the best and most efficient way to calculate the average of a score from the past 2 events within 7 days, and I need it per each row. I already have a query that works on 60M rows, but on 100% (~500M rows) of the data its collapses (maybe not efficient or maybe lack of resources). can you help? If you think my solution is not the best way please explain. Thank you
I have this table:
user_id event_id start end score
---------------------------------------------------
1 7 30/01/2021 30/01/2021 45
1 6 24/01/2021 29/01/2021 25
1 5 22/01/2021 23/01/2021 13
1 4 18/01/2021 21/01/2021 15
1 3 17/01/2021 17/01/2021 52
1 2 08/01/2021 10/01/2021 8
1 1 01/01/2021 02/01/2021 36
I want per line (user id event id): to get the average score of the past 2 events in the last 7 days.
Example: for this row:
user_id event_id start end score
---------------------------------------------------
1 6 24/01/2021 29/01/2021 25
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 null null
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
1 3 17/01/2021 17/01/2021 52 yes 3
1 2 08/01/2021 10/01/2021 8 no 4
1 1 01/01/2021 02/01/2021 36 no 5
so I would select only this rows for the group by and then avg(score):
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
Result:
user_id event_id start end score avg_score_of_past_2_events_within_7_days
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 14
My query:
SELECT user_id, event_id, AVG(score) as avg_score_of_past_2_events_within_7_days
FROM (
SELECT
B.user_id, B.event_id, A.score,
ROW_NUMBER() OVER (PARTITION BY B.user_id, B.event_id ORDER BY A.end desc) AS event_num,
FROM
"df" A
INNER JOIN
(SELECT user_id, event_id, start FROM "df") B
ON B.user_id = FTP.user_id
AND (A.end BETWEEN DATE_SUB(B.start, INTERVAL 7 DAY) AND B.start))
WHERE event_num >= 2
GROUP BY user_id, event_id
Any suggestion for a better way?
CodePudding user response:
I don't believe in your case, there is a more efficient query.
I can suggest you do the following:
Make sure your base table is partition by start and cluster by user_id
Split the query to 3 parts that creating partitioned and clustered tables:
- first table: only the inner join O(n^2)
- second table: add ROW_NUMBER O(n)
- third table: group by
- If it is still a problem I would suggest doing batch preprocessing and run the queries by dates.