Home > front end >  Join and aggregate data based on timestamp
Join and aggregate data based on timestamp

Time:12-01

I am trying to join two data sets together, while at the same time aggregating data from the joining table based on timestamp values from the main table. Essentially, I have two data sets, the first one looks like this:

Table_1
    user id  | change  | timestamp
    -------------------------------
     user_1    initiate  2020-01-01
     user_1    change    2020-03-01
     user_1    change    2021-01-01
     user_1    end       2021-06-01
     user_2    change    2022-03-01
     user_2    renew     2022-06-01

The second looks like this:

Table_2
   user_id | action | action_timestamp
   -----------------------------------
   user_1    xyz      2020-02-01  
   user_1    zyx      2020-02-03
   user_1    123      2021-01-02
   user_1    234      2021-05-01
   user_2    xyz      2022-03-02
   user_2    abc      2022-04-01
   user_2    234      2022-05-30

I want to get to a table that looks like this:

    user id  | change  | timestamp  | actions_since_last_change
    ----------------------------------------------------
     user_1    initiate  2020-01-01    null
     user_1    change    2020-03-01    xyz, zyx
     user_1    change    2021-01-01    null
     user_1    end       2021-06-01    123, 234
     user_2    change    2022-03-01    null
     user_2    renew     2022-06-01    abc, 234

I think its (maybe?) something (pseudo) like this:

on Table_1.user_id = Table_2.user_id and Table_2.timestamp between rows preceeding (partition by user_id)

It also might be in the Select statement of the main query that generates Table_1, like:

Select user_id, change, timestamp, string_agg(Select(action from Table_2),", ")
From Initial_table

Honestly, I'm not sure what the right approach to this is, so any advice would be appreciated!

CodePudding user response:

There is only window function and some grouping needed.

First we generate tbl1 and tbl2 with the data you provided. Next we union both tables to helper. We add a dummy column to indicate from which table, 1 or 2, each rows comes from. In helper2 table we only keep the timestamp from table 1 (dummy=1) and fill out all rows beneath with the same value. So we mapped to all rows of table 2 the starting point of table 1 and call this new column change_time. By grouping by this column and the user_id were done. The action column needs only be aggregated to a string or to an array.

with tbl1 as (
Select  "user_1" user_id,"initiate" change  ,date("2020-01-01") timestamp 
Union all Select    "user_1","change",date("2020-03-01")
Union all Select    "user_1","change",date("2021-01-01")
Union all Select    "user_1","end",date("2021-06-01")
Union all Select    "user_2","change",date("2022-03-01")
Union all Select    "user_2","renew",date("2022-06-01")
),
tbl2 as (
Select  "user_1" user_id,"xyz" action ,date("2020-02-01") action_timestamp
Union all Select    "user_1","zyx",date("2020-02-03")
Union all Select    "user_1","123",date("2021-01-02")
Union all Select    "user_1","234",date("2021-05-01")
Union all Select    "user_2","xyz",date("2022-03-02")
Union all Select    "user_2","abc",date("2022-04-01")
Union all Select    "user_2","234",date("2022-05-30")
),
helper as (
  select 1 as dummy,user_id,timestamp,change,null as action from tbl1 
  union all select 2,user_id,action_timestamp,null,action from tbl2
  ),
helper2 as (
Select *,
last_value(if(dummy=1,timestamp,null) ignore nulls) over win1 as change_time
from helper
window win1 as (partition by user_id order by timestamp rows between unbounded preceding and current row)
#order by user_id,timestamp
)
Select user_id,max(change),change_time,
#array_agg(action ignore nulls order by timestamp) as array_actions_since_last_change ,
string_agg(action  order by timestamp) actions_since_last_change,
from helper2
group by 1,3

This yields the result: enter image description here

CodePudding user response:

This is best done in two stages.

First, we need

WITH cte AS (
    SELECT
        user_id,
        change,
        LAG(timestamp) OVER(user_window) AS prev_timestamp,
        timestamp,
      FROM
        table_1
    WINDOW
        user_window AS (
          PARTITION BY user_id
          ORDER BY timestamp
        )
)

Now you have both prev_timestamp and timestamp to help you with your join.

Next, we need to do that join:

    SELECT
        cte.user_id,
        cte.change,
        cte.timestamp,
        ARRAY_AGG(table_2.action) AS actions_since_last_change,
      FROM
        cte
 LEFT JOIN
        table_2
        ON
           table_2.user_id = cte.user_id
       AND table_2.timestamp BETWEEN cte.prev_timestamp AND cte.timestamp
  GROUP BY
        1, 2, 3

Join conditions are pretty straightforward really: they are simply logical expressions, and they trigger a join for every pair of rows where the expression evaluates to 'true'.

The query just above will join each row of your first table (with the help of the additional column you calculated) to rows from the second table for the same user, where the action timestamp is earlier than the change and later (if applicable) than the preceding change.

In the same query, it will group the resulting rows back into the original form they were in when part of table_1. The rows from the second table are dealt with by creating an array of the actions (which seems to me better than aggregating into a string).

  • Related