I am trying to join two data sets together, while at the same time aggregating data from the joining table based on timestamp values from the main table. Essentially, I have two data sets, the first one looks like this:
Table_1
user id | change | timestamp
-------------------------------
user_1 initiate 2020-01-01
user_1 change 2020-03-01
user_1 change 2021-01-01
user_1 end 2021-06-01
user_2 change 2022-03-01
user_2 renew 2022-06-01
The second looks like this:
Table_2
user_id | action | action_timestamp
-----------------------------------
user_1 xyz 2020-02-01
user_1 zyx 2020-02-03
user_1 123 2021-01-02
user_1 234 2021-05-01
user_2 xyz 2022-03-02
user_2 abc 2022-04-01
user_2 234 2022-05-30
I want to get to a table that looks like this:
user id | change | timestamp | actions_since_last_change
----------------------------------------------------
user_1 initiate 2020-01-01 null
user_1 change 2020-03-01 xyz, zyx
user_1 change 2021-01-01 null
user_1 end 2021-06-01 123, 234
user_2 change 2022-03-01 null
user_2 renew 2022-06-01 abc, 234
I think its (maybe?) something (pseudo) like this:
on Table_1.user_id = Table_2.user_id and Table_2.timestamp between rows preceeding (partition by user_id)
It also might be in the Select statement of the main query that generates Table_1, like:
Select user_id, change, timestamp, string_agg(Select(action from Table_2),", ")
From Initial_table
Honestly, I'm not sure what the right approach to this is, so any advice would be appreciated!
CodePudding user response:
There is only window function and some grouping needed.
First we generate tbl1
and tbl2
with the data you provided. Next we union
both tables to helper
. We add a dummy
column to indicate from which table, 1 or 2, each rows comes from. In helper2
table we only keep the timestamp
from table 1 (dummy=1
) and fill out all rows beneath with the same value. So we mapped to all rows of table 2 the starting point of table 1 and call this new column change_time
. By grouping by this column and the user_id
were done. The action
column needs only be aggregated to a string or to an array.
with tbl1 as (
Select "user_1" user_id,"initiate" change ,date("2020-01-01") timestamp
Union all Select "user_1","change",date("2020-03-01")
Union all Select "user_1","change",date("2021-01-01")
Union all Select "user_1","end",date("2021-06-01")
Union all Select "user_2","change",date("2022-03-01")
Union all Select "user_2","renew",date("2022-06-01")
),
tbl2 as (
Select "user_1" user_id,"xyz" action ,date("2020-02-01") action_timestamp
Union all Select "user_1","zyx",date("2020-02-03")
Union all Select "user_1","123",date("2021-01-02")
Union all Select "user_1","234",date("2021-05-01")
Union all Select "user_2","xyz",date("2022-03-02")
Union all Select "user_2","abc",date("2022-04-01")
Union all Select "user_2","234",date("2022-05-30")
),
helper as (
select 1 as dummy,user_id,timestamp,change,null as action from tbl1
union all select 2,user_id,action_timestamp,null,action from tbl2
),
helper2 as (
Select *,
last_value(if(dummy=1,timestamp,null) ignore nulls) over win1 as change_time
from helper
window win1 as (partition by user_id order by timestamp rows between unbounded preceding and current row)
#order by user_id,timestamp
)
Select user_id,max(change),change_time,
#array_agg(action ignore nulls order by timestamp) as array_actions_since_last_change ,
string_agg(action order by timestamp) actions_since_last_change,
from helper2
group by 1,3
CodePudding user response:
This is best done in two stages.
First, we need
WITH cte AS (
SELECT
user_id,
change,
LAG(timestamp) OVER(user_window) AS prev_timestamp,
timestamp,
FROM
table_1
WINDOW
user_window AS (
PARTITION BY user_id
ORDER BY timestamp
)
)
Now you have both prev_timestamp
and timestamp
to help you with your join.
Next, we need to do that join:
SELECT
cte.user_id,
cte.change,
cte.timestamp,
ARRAY_AGG(table_2.action) AS actions_since_last_change,
FROM
cte
LEFT JOIN
table_2
ON
table_2.user_id = cte.user_id
AND table_2.timestamp BETWEEN cte.prev_timestamp AND cte.timestamp
GROUP BY
1, 2, 3
Join conditions are pretty straightforward really: they are simply logical expressions, and they trigger a join for every pair of rows where the expression evaluates to 'true'.
The query just above will join each row of your first table (with the help of the additional column you calculated) to rows from the second table for the same user, where the action timestamp is earlier than the change and later (if applicable) than the preceding change.
In the same query, it will group the resulting rows back into the original form they were in when part of table_1
. The rows from the second table are dealt with by creating an array of the actions (which seems to me better than aggregating into a string).