How would it be possible to a get previous timestamp (or any other key for that matter) when there are multiple entries for the same timestamp in SQL (presto).
This is an ideal example outcome
user | timestamp | category | previous timestamp |
---|---|---|---|
Mary | 1000 | a | |
Mary | 1050 | b | 1000 |
Mary | 1050 | a | 1000 |
Mary | 1050 | c | 1000 |
however if using lag(timestamp,1) over (partition by user order by timestamp) as previous_timestamp
you get the following
user | timestamp | category | previous timestamp |
---|---|---|---|
Mary | 1000 | a | |
Mary | 1050 | b | 1000 |
Mary | 1050 | a | 1050 |
Mary | 1050 | c | 1050 |
which makes sense given that lag()
uses the previous row but it's not the expected outcome. Is there any way to make this work?
CodePudding user response:
Syntax for SQL Server, but the logic should work.
declare @some_table table (
[user] varchar(5),
[timestamp] int,
[category] varchar(5)
)
insert @some_table
values
('Mary', 1000, 'a')
, ('Mary', 1050, 'b')
, ('Mary', 1050, 'a')
, ('Mary', 1050, 'c')
select *
from @some_table
;
with a1 as (
select distinct [user]
, [timestamp]
from @some_table
),
a2 as (
select distinct [user]
, [timestamp]
, lag([timestamp],1) over (partition by [user] order by [timestamp]) as previous_timestamp
from a1
)
select a.[user]
, a.[timestamp]
, b.category
, a.previous_timestamp
from (
select [user]
, [timestamp]
, lag([timestamp],1) over (partition by [user] order by [timestamp]) as previous_timestamp
from a1
) a
join @some_table b on a.[user] = b.[user]
and a.[timestamp] = b.[timestamp];
CodePudding user response:
In Trino version you can use improved support for frame type RANGE (which allows to use previous rows that are at least 1 less then current ones):
-- sample data
WITH dataset(user, timestamp, category) AS (
values ('Mary', 1000, 'a'),
('Mary', 1050, 'b'),
('Mary', 1050, 'a'),
('Mary', 1050, 'c')
)
-- query
select user,
timestamp,
max(timestamp) over(partition by user order by timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
from dataset;
Output:
user | timestamp | _col2 |
---|---|---|
Mary | 1000 | NULL |
Mary | 1050 | 1000 |
Mary | 1050 | 1000 |
Mary | 1050 | 1000 |
If this feature is not available in Presto version running by Athena, then you can use approach a bit derived from gaps-and-islands but instead of creating groups and summing values you will set "lag" value to null if it is equal to current one and then use max
which will ignore nulls:
-- sample data
WITH dataset(user, timestamp, category) AS (
values ('Mary', 1000, 'a'),
('Mary', 1050, 'b'),
('Mary', 1050, 'a'),
('Mary', 1050, 'c'),
('Mary', 1051, 'c'),
('Mary', 1051, 'c'),
('Mary', 1052, 'c')
),
-- query parts
with_lag as (
select user,
timestamp,
lag(timestamp) over(partition by user order by timestamp) lag_t -- "ordinary" lag
from dataset
),
with_lag_nulls as (
select *,
if(lag_t is null, timestamp, if(lag_t != timestamp, lag_t)) actual_for_lag -- if lag is null use current, if current is not equal to lag, use lag
from with_lag
)
select user,
timestamp,
if(
lag_t is null,
null, -- null for first row in partition
max(actual_for_lag) over(partition by user order by timestamp)) -- then use max "adjusted" lag
from with_lag_nulls;
Output:
user | timestamp | _col2 |
---|---|---|
Mary | 1000 | NULL |
Mary | 1050 | 1000 |
Mary | 1050 | 1000 |
Mary | 1050 | 1000 |
Mary | 1051 | 1050 |
Mary | 1051 | 1050 |
Mary | 1052 | 1051 |