Home > Software design >  Efficient way to get the average of past x events within d days per each row in SQL (big data)
Efficient way to get the average of past x events within d days per each row in SQL (big data)

Time:01-03

I want to find the best and most efficient way to calculate the average of a score from the past 2 events within 7 days, and I need it per each row. I already have a query that works on 60M rows, but on 100% (~500M rows) of the data its collapses (maybe not efficient or maybe lack of resources). can you help? If you think my solution is not the best way please explain. Thank you

I have this table:

user_id  event_id     start        end       score    
---------------------------------------------------
   1       7       30/01/2021   30/01/2021     45       
   1       6       24/01/2021   29/01/2021     25 
   1       5       22/01/2021   23/01/2021     13    
   1       4       18/01/2021   21/01/2021     15
   1       3       17/01/2021   17/01/2021     52 
   1       2       08/01/2021   10/01/2021     8    
   1       1       01/01/2021   02/01/2021     36

I want per line (user id event id): to get the average score of the past 2 events in the last 7 days.

Example: for this row:

user_id  event_id     start        end       score    
---------------------------------------------------
   1       6       24/01/2021   29/01/2021     25 


user_id  event_id     start        end       score past_7_days_from_start   event_num  
--------------------------------------------------------------------------------------
   1       6       24/01/2021   29/01/2021     25             null              null
   1       5       22/01/2021   23/01/2021     13              yes               1  
   1       4       18/01/2021   21/01/2021     15              yes               2  
   1       3       17/01/2021   17/01/2021     52              yes               3     
   1       2       08/01/2021   10/01/2021     8               no                4      
   1       1       01/01/2021   02/01/2021     36              no                5   

so I would select only this rows for the group by and then avg(score):

user_id  event_id     start        end       score past_7_days_from_start   event_num  
--------------------------------------------------------------------------------------
   1       5       22/01/2021   23/01/2021     13              yes               1  
   1       4       18/01/2021   21/01/2021     15              yes               2  

Result:

user_id  event_id   start      end     score avg_score_of_past_2_events_within_7_days   
--------------------------------------------------------------------------------------
   1       6    24/01/2021 29/01/2021   25                  14

My query:

SELECT user_id, event_id, AVG(score) as avg_score_of_past_2_events_within_7_days
FROM (
    SELECT 
        B.user_id, B.event_id, A.score,
        ROW_NUMBER() OVER (PARTITION BY B.user_id, B.event_id ORDER BY A.end desc) AS event_num,
    FROM
        "df" A
    INNER JOIN
        (SELECT user_id, event_id, start FROM "df") B 
            ON B.user_id =  FTP.user_id
            AND (A.end BETWEEN DATE_SUB(B.start, INTERVAL 7 DAY) AND B.start))
WHERE event_num >= 2
GROUP BY user_id, event_id

Any suggestion for a better way?

CodePudding user response:

I don't believe in your case, there is a more efficient query.

I can suggest you do the following:

  1. Make sure your base table is partition by start and cluster by user_id

  2. Split the query to 3 parts that creating partitioned and clustered tables:

  • first table: only the inner join O(n^2)
  • second table: add ROW_NUMBER O(n)
  • third table: group by
  1. If it is still a problem I would suggest doing batch preprocessing and run the queries by dates.
  • Related