I have a table in snowflake/dbt, where I want to select distinct sequential entries in the rows. For example: If I have
user_id | session_id | action | timestamp |
---|---|---|---|
2 | 3 | scroll | 21-08-01 12:00:01 |
2 | 3 | scroll | 21-08-01 12:00:02 |
2 | 3 | scroll | 21-08-01 12:00:03 |
2 | 3 | click | 21-08-01 12:00:04 |
2 | 3 | click | 21-08-01 12:00:06 |
2 | 3 | scroll | 21-08-01 12:00:10 |
2 | 3 | saved | 21-08-01 12:00:10 |
I want to have this at the end
user_id | session_id | action | timestamp |
---|---|---|---|
2 | 3 | scroll | 21-08-01 12:00:03 |
2 | 3 | click | 21-08-01 12:00:06 |
2 | 3 | scroll | 21-08-01 12:00:10 |
2 | 3 | saved | 21-08-01 12:00:10 |
I tried to use row_number() and next qualify but that will numerate all the action sequentially even if they are not.
CodePudding user response:
You may try the following which groups the closest occurring actions and chooses the most recent occurrence in the order that they appear.
SELECT
user_id,
session_id,
action,
timestamp
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id,session_id,action,gn
ORDER BY timestamp DESC
) as rn
FROM (
SELECT
*,
SUM(continued) OVER (ORDER BY timestamp) as gn
FROM (
SELECT
*,
CASE
WHEN
LAG(
CONCAT(user_id,session_id,action),
1,
CONCAT(user_id,session_id,action)
) OVER (
ORDER BY timestamp
) = CONCAT(user_id,session_id,action) THEN 0
ELSE 1
END as continued
FROM
my_table
) t2
) t1
) t
WHERE rn=1
Let me know if this works for you
CodePudding user response:
I tried something a little bit different than ggordon, building an inline view with the contents of the "next" record (using the LEAD function).
select user_id, session_id, action, ts
from (
select abc.*,
lead(user_id) ignore nulls
over (order by ts, user_id, session_id, action) next_user_id,
lead(session_id) ignore nulls
over (order by ts, user_id, session_id, action) next_session_id,
lead(action) ignore nulls
over (order by ts, user_id, session_id, action) next_action,
lead(ts) ignore nulls
over (order by ts, user_id, session_id, action) next_ts
from abc
order by ts, user_id, session_id, action)
where user_id = NVL(next_user_id, user_id)
and session_id = NVL(next_session_id, session_id)
and action <> NVL(next_action, 'x')
order by ts, user_id, session_id, action;
This worked well and I was able to get the same four records you wanted.
I hope this helps...Rich
p.s. If this (or another) answer helps you, please take a moment to "accept" the answer that helped by clicking on the check mark beside the answer to toggle it from "greyed out" to "filled in".
CodePudding user response:
This is called a gaps and islands problem. This is usually solved by creating group keys via two concurrent row numberings.
select
user_id, session_id, action, max(timestamp)
from
(
select
user_id, session_id, action, timestamp,
row_number() over (order by timestamp, user_id, session_id, action) -
row_number() over (partition by user_id, session_id, action order by timestamp)
as grp
from mytable
)
group by grp, user_id, session_id, action
order by grp, user_id, session_id, action;