Home > Net >  historical aggregation of a column up until a specified time in each row in another column
historical aggregation of a column up until a specified time in each row in another column

Time:10-08

I have two tables login_attempts and checkouts in Amazon RedShift. A user can have multiple (un)successful login attempts and multiple (un)successful checkouts as shown in this example:

login_attempts

login_id | user_id  |       login           |   success
-------------------------------------------------------
1        |  1       |   2021-07-01 14:00:00 |   0
2        |  1       |   2021-07-01 16:00:00 |   1
3        |  2       |   2021-07-02 05:01:01 |   1
4        |  1       |   2021-07-04 03:25:34 |   0
5        |  2       |   2021-07-05 11:20:50 |   0
6        |  2       |   2021-07-07 12:34:56 |   1

and

checkouts

checkout_id |   checkout_time       | user_id   |   success
------------------------------------------------------------
1           |   2021-07-01 18:00:00 |   1       |   0
2           |   2021-07-02 06:54:32 |   2       |   1
3           |   2021-07-04 13:00:01 |   1       |   1
4           |   2021-07-08 09:05:00 |   2       |   1

Given this information, how can I get the following table with historical performance included for each checkout AS OF THAT TIME?

checkout_id | checkout              | user_id | lastGoodLogin       | lastFailedLogin     |  lastGoodCheckout   |  lastFailedCheckout |
---------------------------------------------------------------------------------------------------------------------------------------
1           | 2021-07-01 18:00:00   | 1       | 2021-07-01 16:00:00 | 2021-07-01 14:00:00 |       NULL          |     NULL
2           | 2021-07-02 06:54:32   | 2       | 2021-07-02 05:01:01 |       NULL          |       NULL          |     NULL
3           | 2021-07-04 13:00:01   | 1       | 2021-07-01 16:00:00 | 2021-07-04 03:25:34 |       NULL          | 2021-07-01 18:00:00
4           | 2021-07-08 09:05:00   | 2       | 2021-07-07 12:34:56 | 2021-07-05 11:20:50 | 2021-07-02 06:54:32 |     NULL

Update: I was able to get lastFailedCheckout & lastGoodCheckout because that's doing window operations on the same table (checkouts) but I am failing to understand how to best join it with login_attempts table to get last[Good|Failed]Login fields. (sqlfiddle)

P.S.: I am open to PostgreSQL suggestions as well.

CodePudding user response:

Good start! A couple things in your SQL - 1) You should really try to avoid inequality joins as these can lead to data explosions and aren't needed in this case. Just put a CASE statement inside your window function to use only the type of checkout (or login) you want. 2) You can use the frame clause to not self select the same row when finding previous checkouts.

Once you have this pattern you can use it to find the other 2 columns of data you are looking for. The first step is to UNION the tables together, not JOIN. This means making a few more columns so the data can live together but that is easy. Now you have the userid and the time the "thing" happened all in the same data. You just need to WINDOW 2 more times to pull the info you want. Lastly, you need to strip out the non-checkout rows with an outer select w/ where clause.

Like this:

create table login_attempts(
  loginid smallint,
  userid smallint,
  login timestamp,
  success smallint
);

create table checkouts(
  checkoutid smallint,
  userid smallint,
  checkout_time timestamp,
  success smallint
);


insert into login_attempts values
(1, 1, '2021-07-01 14:00:00', 0),
(2, 1, '2021-07-01 16:00:00', 1),
(3, 2, '2021-07-02 05:01:01', 1),
(4, 1, '2021-07-04 03:25:34', 0),
(5, 2, '2021-07-05 11:20:50', 0),
(6, 2, '2021-07-07 12:34:56', 1)
;

insert into checkouts values
(1, 1, '2021-07-01 18:00:00', 0),
(2, 2, '2021-07-02 06:54:32', 1),
(3, 1, '2021-07-04 13:00:01', 1),
(4, 2, '2021-07-08 09:05:00', 1)
;

SQL:

select * 
from (
 select 
  c.checkoutid,
  c.userid,
  c.checkout_time,
  
  max(case success when 0 then checkout_time end) over (
    partition by userid
    order by event_time
    rows between unbounded preceding and 1 preceding
  ) as lastFailedCheckout,  
  
  max(case success when 1 then checkout_time end) over (
    partition by userid
    order by event_time
    rows between unbounded preceding and 1 preceding
  ) as lastGoodCheckout,

  max(case lsuccess when 0 then login end) over (
    partition by userid
    order by event_time
    rows between unbounded preceding and 1 preceding
  ) as lastFailedLogin,  
  
  max(case lsuccess when 1 then login end) over (
    partition by userid
    order by event_time
    rows between unbounded preceding and 1 preceding
  ) as lastGoodLogin
  
 from (
  select checkout_time as event_time, checkoutid, userid, 
    checkout_time, success,
    NULL as login, NULL as lsuccess
  from checkouts
  UNION ALL
  select login as event_time,NULL as checkoutid, userid, 
    NULL as checkout_time, NULL as success,  
    login, success as lsuccess
  from login_attempts
 ) c
) o
where o.checkoutid is not null
order by o.checkoutid
  • Related