Selecting Distinct Consecutive Values of Timeseries-CodePudding

I have a table in snowflake/dbt, where I want to select distinct sequential entries in the rows. For example: If I have

user_id	session_id	action	timestamp
2	3	scroll	21-08-01 12:00:01
2	3	scroll	21-08-01 12:00:02
2	3	scroll	21-08-01 12:00:03
2	3	click	21-08-01 12:00:04
2	3	click	21-08-01 12:00:06
2	3	scroll	21-08-01 12:00:10
2	3	saved	21-08-01 12:00:10

I want to have this at the end

user_id	session_id	action	timestamp
2	3	scroll	21-08-01 12:00:03
2	3	click	21-08-01 12:00:06
2	3	scroll	21-08-01 12:00:10
2	3	saved	21-08-01 12:00:10

I tried to use row_number() and next qualify but that will numerate all the action sequentially even if they are not.

CodePudding user response：

You may try the following which groups the closest occurring actions and chooses the most recent occurrence in the order that they appear.

SELECT
    user_id,
    session_id,
    action,
    timestamp
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
             PARTITION BY user_id,session_id,action,gn
             ORDER BY timestamp DESC
        ) as rn
    FROM (
        SELECT
            *,
            SUM(continued) OVER (ORDER BY timestamp) as gn
        FROM (
            SELECT
                *,
                CASE 
                    WHEN
                        LAG(
                            CONCAT(user_id,session_id,action),
                            1,
                            CONCAT(user_id,session_id,action)
                        ) OVER (
                            ORDER BY timestamp
                        ) = CONCAT(user_id,session_id,action) THEN 0
                    ELSE 1
                END as continued
            FROM
                my_table
        ) t2
    ) t1
) t
WHERE rn=1

Let me know if this works for you

CodePudding user response：

I tried something a little bit different than ggordon, building an inline view with the contents of the "next" record (using the LEAD function).

select user_id, session_id, action, ts
from (
  select abc.*, 
         lead(user_id) ignore nulls 
           over (order by ts, user_id, session_id, action) next_user_id, 
         lead(session_id) ignore nulls 
           over (order by ts, user_id, session_id, action) next_session_id, 
         lead(action) ignore nulls 
           over (order by ts, user_id, session_id, action) next_action, 
         lead(ts) ignore nulls 
           over (order by ts, user_id, session_id, action) next_ts
  from   abc 
  order by ts, user_id, session_id, action)
where user_id = NVL(next_user_id, user_id)
and   session_id = NVL(next_session_id, session_id)
and   action <> NVL(next_action, 'x')
order by ts, user_id, session_id, action;

This worked well and I was able to get the same four records you wanted.

I hope this helps...Rich

p.s. If this (or another) answer helps you, please take a moment to "accept" the answer that helped by clicking on the check mark beside the answer to toggle it from "greyed out" to "filled in".

CodePudding user response：

This is called a gaps and islands problem. This is usually solved by creating group keys via two concurrent row numberings.

select
  user_id, session_id, action, max(timestamp)
from
(
  select
    user_id, session_id, action, timestamp,
    row_number() over (order by timestamp, user_id, session_id, action) -
    row_number() over (partition by user_id, session_id, action order by timestamp)
      as grp
  from mytable
)
group by grp, user_id, session_id, action
order by grp, user_id, session_id, action;