Home > Back-end >  How to use variable lag window functions?
How to use variable lag window functions?

Time:12-21

I have a table with the following schema:

CREATE TABLE example (
    userID,
    status, --'SUCCESS' or 'FAIL'
    date -- self explanatory 
   
);
INSERT INTO example
Values(123, 'SUCCESS', 20211010),
(123, 'SUCCESS', 20211011),
(123, 'SUCCESS', 20211028),
(123, 'FAIL', 20211029),
(123, 'SUCCESS', 20211105),
(123, 'SUCCESS', 20211110)

I am trying to utilize a lag or lead function to assess whether the current line is within a 2-week window of the previous 'SUCCESS'. Given the current data, I would expect a isWithin2WeeksofSuccessFlag to be as following:

123, 'SUCCESS', 20211010,0 --since it is the first instance
123, 'SUCCESS', 20211011,1
123, 'SUCCESS', 20211028,1
123, 'FAIL', 20211029, 1 --failed, but criteria is that it is within 2 weeks of last success, which it is
123, 'SUCCESS', 20211105, 1 --last success is 2 rows back, but it is within 2 weeks
123, 'SUCCESS', 20211128, 0 --outside of 2 weeks

I would initially think to do something like this:

Select userID, status, date, 
case when lag(status,1) over (partition by userid order by date asc) = 'SUCCESS' 
and date_add('day',-14, date) <= lag(date,1) over (partition by userid order by date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example

This would work if I didn't have the 'FAIL' line in there. To handle it, I could modify the lag to 2 (instead of 1), but what about if I have 2,3,4,n 'FAIL's in a row? I would need to lag by 3,4,5,n 1. The specific number of FAILs in between is variable. How do I handle this variability?

NOTE I am querying billions of rows. Efficiency isn't really a concern (since it is for analysis), but running into memory allocation errors is.Thus, endlessly adding more window functions would likely cause an automatic termination of the query due memory requirement being above node limit.

How should I handle this?

CodePudding user response:

If your DBMS doesn't support window function filters you can order by status desc so 'SUCCESS' goes before 'FAIL'.

select userID, status, date, 
  case when lag(status,1) over (partition by userid order by status desc , date asc) = 'SUCCESS' 
        and dateadd(d, -14, date) <= lag(date,1) over (partition by userid order by status desc , date asc)
  then 1 end as isWithin2WeeksofSuccessFlag
from example
order by date

Sql Server fiddle

CodePudding user response:

Here's an approach, also using window functions, with each "common table expression" handling one step at a time.

Note: The expected result in the question does not match the data in the question. '20211128' doesn't exist in the actual data. I used the example INSERT statement.

In the test case, I changed the column name to xdate to avoid any potential SQL reserved word issues.

The SQL:

WITH cte1 AS (
         SELECT *
              , SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) OVER (PARTITION BY userID ORDER BY xdate) AS grp
           FROM example
     )
   , cte2 AS (
         SELECT *
              , MAX(CASE WHEN status = 'SUCCESS' THEN xdate END) OVER (PARTITION BY userID, grp) AS lastdate
           FROM cte1
     )
   , cte3 AS (
         SELECT *
              , CASE WHEN LAG(lastdate) OVER (PARTITION BY userID ORDER BY xdate) > (xdate - INTERVAL '2' WEEK) THEN 1 ELSE 0 END AS isNear
           FROM cte2
     )
SELECT * FROM cte3
 ORDER BY userID, xdate
;

The result:

 -------- --------- ------------ ------ ------------ -------- 
| userID | status  | xdate      | grp  | lastdate   | isNear |
 -------- --------- ------------ ------ ------------ -------- 
|    123 | SUCCESS | 2021-10-10 |    1 | 2021-10-10 |      0 |
|    123 | SUCCESS | 2021-10-11 |    2 | 2021-10-11 |      1 |
|    123 | SUCCESS | 2021-10-28 |    3 | 2021-10-28 |      0 |
|    123 | FAIL    | 2021-10-29 |    3 | 2021-10-28 |      1 |
|    123 | SUCCESS | 2021-11-05 |    4 | 2021-11-05 |      1 |
|    123 | SUCCESS | 2021-11-10 |    5 | 2021-11-10 |      1 |
 -------- --------- ------------ ------ ------------ -------- 

and with the data adjusted to match your expected result, plus a new user introduced, the result is this:

 -------- --------- ------------ ------ ------------ -------- 
| userID | status  | xdate      | grp  | lastdate   | isNear |
 -------- --------- ------------ ------ ------------ -------- 
|    123 | SUCCESS | 2021-10-10 |    1 | 2021-10-10 |      0 |
|    123 | SUCCESS | 2021-10-11 |    2 | 2021-10-11 |      1 |
|    123 | SUCCESS | 2021-10-28 |    3 | 2021-10-28 |      0 |
|    123 | FAIL    | 2021-10-29 |    3 | 2021-10-28 |      1 |
|    123 | SUCCESS | 2021-11-05 |    4 | 2021-11-05 |      1 |
|    123 | SUCCESS | 2021-11-28 |    5 | 2021-11-28 |      0 |
|    323 | SUCCESS | 2021-10-10 |    1 | 2021-10-10 |      0 |
|    323 | SUCCESS | 2021-10-11 |    2 | 2021-10-11 |      1 |
|    323 | SUCCESS | 2021-10-28 |    3 | 2021-10-28 |      0 |
|    323 | FAIL    | 2021-10-29 |    3 | 2021-10-28 |      1 |
|    323 | SUCCESS | 2021-11-05 |    4 | 2021-11-05 |      1 |
|    323 | SUCCESS | 2021-11-28 |    5 | 2021-11-28 |      0 |
 -------- --------- ------------ ------ ------------ -------- 
  • Related