Access "real" previous value with SQL lag having duplicate entries for same key-CodePudding

How would it be possible to a get previous timestamp (or any other key for that matter) when there are multiple entries for the same timestamp in SQL (presto).

This is an ideal example outcome

user	timestamp	category	previous timestamp
Mary	1000	a
Mary	1050	b	1000
Mary	1050	a	1000
Mary	1050	c	1000

however if using lag(timestamp,1) over (partition by user order by timestamp) as previous_timestamp you get the following

user	timestamp	category	previous timestamp
Mary	1000	a
Mary	1050	b	1000
Mary	1050	a	1050
Mary	1050	c	1050

which makes sense given that lag() uses the previous row but it's not the expected outcome. Is there any way to make this work?

CodePudding user response：

Syntax for SQL Server, but the logic should work.

declare @some_table table (
  [user] varchar(5),
  [timestamp] int,
  [category] varchar(5)
)

insert @some_table
values 
  ('Mary', 1000, 'a')
, ('Mary', 1050, 'b')
, ('Mary', 1050, 'a')
, ('Mary', 1050, 'c')

select *
from @some_table

;
with a1 as (
  select distinct [user]
  , [timestamp]
  from @some_table
),
a2 as (
  select distinct [user]
  , [timestamp]
  , lag([timestamp],1) over (partition by [user] order by [timestamp]) as previous_timestamp
  from a1
)

select a.[user]
, a.[timestamp]
, b.category
, a.previous_timestamp

from (
    select [user]
    , [timestamp]
    , lag([timestamp],1) over (partition by [user] order by [timestamp]) as previous_timestamp
    from a1
  ) a 
  join @some_table b on a.[user] = b.[user] 
                 and a.[timestamp] = b.[timestamp];

CodePudding user response：

In Trino version you can use improved support for frame type RANGE (which allows to use previous rows that are at least 1 less then current ones):

-- sample data
WITH dataset(user, timestamp, category) AS (
 values ('Mary', 1000, 'a'),
    ('Mary', 1050, 'b'),
    ('Mary', 1050, 'a'),
    ('Mary', 1050, 'c')
)

-- query
select user,
    timestamp,
    max(timestamp) over(partition by user order by timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
from dataset;

Output:

user	timestamp	_col2
Mary	1000	NULL
Mary	1050	1000
Mary	1050	1000
Mary	1050	1000

If this feature is not available in Presto version running by Athena, then you can use approach a bit derived from gaps-and-islands but instead of creating groups and summing values you will set "lag" value to null if it is equal to current one and then use max which will ignore nulls:

-- sample data
WITH dataset(user, timestamp, category) AS (
 values ('Mary', 1000, 'a'),
    ('Mary', 1050, 'b'),
    ('Mary', 1050, 'a'),
    ('Mary', 1050, 'c'),
    ('Mary', 1051, 'c'),
    ('Mary', 1051, 'c'),
    ('Mary', 1052, 'c')
),
-- query parts
with_lag as (
    select user,
        timestamp,
        lag(timestamp) over(partition by user order by timestamp) lag_t -- "ordinary" lag
    from dataset
),
with_lag_nulls as (
    select *,
       if(lag_t is null, timestamp, if(lag_t != timestamp, lag_t)) actual_for_lag -- if lag is null use current, if current is not equal to lag, use lag
    from with_lag
)

select user,
       timestamp,
       if(
           lag_t is null, 
           null, -- null for first row in partition
           max(actual_for_lag) over(partition by user order by timestamp)) -- then use max "adjusted" lag  
from with_lag_nulls;

Output:

user	timestamp	_col2
Mary	1000	NULL
Mary	1050	1000
Mary	1050	1000
Mary	1050	1000
Mary	1051	1050
Mary	1051	1050
Mary	1052	1051