Home > Blockchain >  Selecting Distinct Consecutive Values of Timeseries
Selecting Distinct Consecutive Values of Timeseries

Time:10-21

I have a table in snowflake/dbt, where I want to select distinct sequential entries in the rows. For example: If I have

user_id session_id action timestamp
2 3 scroll 21-08-01 12:00:01
2 3 scroll 21-08-01 12:00:02
2 3 scroll 21-08-01 12:00:03
2 3 click 21-08-01 12:00:04
2 3 click 21-08-01 12:00:06
2 3 scroll 21-08-01 12:00:10
2 3 saved 21-08-01 12:00:10

I want to have this at the end

user_id session_id action timestamp
2 3 scroll 21-08-01 12:00:03
2 3 click 21-08-01 12:00:06
2 3 scroll 21-08-01 12:00:10
2 3 saved 21-08-01 12:00:10

I tried to use row_number() and next qualify but that will numerate all the action sequentially even if they are not.

CodePudding user response:

You may try the following which groups the closest occurring actions and chooses the most recent occurrence in the order that they appear.

SELECT
    user_id,
    session_id,
    action,
    timestamp
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
             PARTITION BY user_id,session_id,action,gn
             ORDER BY timestamp DESC
        ) as rn
    FROM (
        SELECT
            *,
            SUM(continued) OVER (ORDER BY timestamp) as gn
        FROM (
            SELECT
                *,
                CASE 
                    WHEN
                        LAG(
                            CONCAT(user_id,session_id,action),
                            1,
                            CONCAT(user_id,session_id,action)
                        ) OVER (
                            ORDER BY timestamp
                        ) = CONCAT(user_id,session_id,action) THEN 0
                    ELSE 1
                END as continued
            FROM
                my_table
        ) t2
    ) t1
) t
WHERE rn=1

Let me know if this works for you

CodePudding user response:

I tried something a little bit different than ggordon, building an inline view with the contents of the "next" record (using the LEAD function).

select user_id, session_id, action, ts
from (
  select abc.*, 
         lead(user_id) ignore nulls 
           over (order by ts, user_id, session_id, action) next_user_id, 
         lead(session_id) ignore nulls 
           over (order by ts, user_id, session_id, action) next_session_id, 
         lead(action) ignore nulls 
           over (order by ts, user_id, session_id, action) next_action, 
         lead(ts) ignore nulls 
           over (order by ts, user_id, session_id, action) next_ts
  from   abc 
  order by ts, user_id, session_id, action)
where user_id = NVL(next_user_id, user_id)
and   session_id = NVL(next_session_id, session_id)
and   action <> NVL(next_action, 'x')
order by ts, user_id, session_id, action;

This worked well and I was able to get the same four records you wanted.

I hope this helps...Rich

p.s. If this (or another) answer helps you, please take a moment to "accept" the answer that helped by clicking on the check mark beside the answer to toggle it from "greyed out" to "filled in".

CodePudding user response:

This is called a gaps and islands problem. This is usually solved by creating group keys via two concurrent row numberings.

select
  user_id, session_id, action, max(timestamp)
from
(
  select
    user_id, session_id, action, timestamp,
    row_number() over (order by timestamp, user_id, session_id, action) -
    row_number() over (partition by user_id, session_id, action order by timestamp)
      as grp
  from mytable
)
group by grp, user_id, session_id, action
order by grp, user_id, session_id, action;
  • Related