How to get only last row if condition is met within 30 days span-CodePudding

I have the dataset that looks like following:

    id  date           visit_number  total_visit  registrated
    1    1/1/2020   .         1     1             0     --get
    1    1/5/2020   .         2     2             1     
    1    1/9/2020   .         3     3             1     
    1    1/13/2020  .         4     4             1     
    1    1/17/2020  .         5     5             1     
    1    1/21/2020  .         6     6             1     
    1    1/25/2020  .         7     7             1     
    1    1/29/2020  .         8     8             1     
    1    2/2/2020   .         9     9             1     --get
    1    2/6/2020   .         1     10            0     --get
    1    2/10/2020  .         2     11            0     --get
    1    2/14/2020  .         3     12            0     --get
    1   2/18/2020   .         4     13            0     --get
    1    3/22/2020  .         5     14            1     
    1    3/26/2020  .         6     15            1     
    1    4/1/2020   .         7     16            1     
    1    4/5/2020   .         8     17            1     
    1    4/9/2020   .         9     18            1     
    1    4/13/2020  .        10     19            1      --get          
    1    5/15/2020  .         1     20            0      --get
    1    6/20/2020  .         2     21            1      --get

So the pattern here is following: Whenever the user enters the website and buys something the column registration gets 1 and the previous 30 days get 1 on registration as well.

For example, Someone bought something on 4/13/2020, registration gets 1 and all the visits to the website from the same user - his registration column gets 1 starting from 4/13/2020 up until 3/12/2020. And starting from 3/12/2020 his registration column is 0 again. And whenever in the registration column after 1 comes 0 then visit_number starts to count again. Total_visit here counts all the visits to the website from the same user. There are many users, but total_visit and visit_number are partitioned by visitor_id and date.

Now, I want to get only that last date when registration was 1, instead of getting all the previous 30 days.

So again, For example, Someone bought something on 4/13/2020 then I wanna get this row only, not all previous rows within 30 days.

So it should eventually look like this:

date                 visit_number total_visit registration

1/1/2020        .           1       1           0       --got   
                        
                        
                        
                        
                        
                        
                        
2/2/2020         .          9       9       1       --got   
2/6/2020         .          1       10          0       --got
2/10/2020        .          2       11          0       --got   
2/14/2020        .          3       12          0       --got   
2/18/2020        .          4       13          0       --got   
                        
                        
                        
                        
                        
4/13/2020        .         10       19          1       --got       
5/15/2020        .          1       20          0       --got   
6/20/2020        .          2       21          1       --got

remember, registration = 0 here because the user did not buy anything on that day and he did not buy anything for the next 30 days.

I wrote down conditions to help me get my head around the problem:

1. if day difference between the date of current row and the date of previous row is bigger than 30 days, get both rows
For example: 6/20/2020 and 5/15/2020 difference is more than 30 days, therefore we got both

2. if date difference is smaller than 30 days, then check for current row, and if its = 1 then take that only, and remove all 30 days

3. if registration is 0, that easily indicates that we get it automatically.

I tried different things,

 select visitor_id, dt1, reg30,
    case when date_part('day',dt1) - date_part('day',lag(dt1) over(partition by visitor_id order by dt)) > 30 then 'True' 

             when  date_part('day',dt1) - date_part('day',lag(dt1) over(partition by visitor_id order by dt)) < 30 and reg30='1' then 'True'
        
             when reg30 = '0' then 'True'
                                        else 'False' 
    end

from new_table
order by visitor_id, dt

CodePudding user response：

Disclaimer: I don't use PostgreSQL so there may be a more efficient method, but a general approach that may work is using LEAD() to peek at the registration value in the next row. Then return rows where either:

The current registration value = 1 and the next value = 0 (or the next value is null, meaning the last row)... or
current registration value = 0

SQL:

WITH cte AS (
   SELECT *
         , LEAD(registrated, 1) OVER(ORDER BY id, "date") AS NextValue
   FROM  YourTable       
)
SELECT * 
FROM   cte 
WHERE  registrated = 0 
OR   ( registrated = 1 AND
        ( NextValue = 0 OR NextValue IS NULL )
     )

Results:

id | date                | visit_number | total_visit | registrated | nextvalue
-: | :------------------ | -----------: | ----------: | ----------: | --------:
 1 | 2020-01-01 00:00:00 |            1 |           1 |           0 |         1
 1 | 2020-02-02 00:00:00 |            9 |           9 |           1 |         0
 1 | 2020-02-06 00:00:00 |            1 |          10 |           0 |         0
 1 | 2020-02-10 00:00:00 |            2 |          11 |           0 |         0
 1 | 2020-02-14 00:00:00 |            3 |          12 |           0 |         0
 1 | 2020-02-18 00:00:00 |            4 |          13 |           0 |         1
 1 | 2020-04-13 00:00:00 |           10 |          19 |           1 |         0
 1 | 2020-05-15 00:00:00 |            1 |          20 |           0 |         1
 1 | 2020-06-20 00:00:00 |            2 |          21 |           1 |      null

db<>fiddle here