I have the dataset that looks like following:
id date visit_number total_visit registrated
1 1/1/2020 . 1 1 0 --get
1 1/5/2020 . 2 2 1
1 1/9/2020 . 3 3 1
1 1/13/2020 . 4 4 1
1 1/17/2020 . 5 5 1
1 1/21/2020 . 6 6 1
1 1/25/2020 . 7 7 1
1 1/29/2020 . 8 8 1
1 2/2/2020 . 9 9 1 --get
1 2/6/2020 . 1 10 0 --get
1 2/10/2020 . 2 11 0 --get
1 2/14/2020 . 3 12 0 --get
1 2/18/2020 . 4 13 0 --get
1 3/22/2020 . 5 14 1
1 3/26/2020 . 6 15 1
1 4/1/2020 . 7 16 1
1 4/5/2020 . 8 17 1
1 4/9/2020 . 9 18 1
1 4/13/2020 . 10 19 1 --get
1 5/15/2020 . 1 20 0 --get
1 6/20/2020 . 2 21 1 --get
So the pattern here is following: Whenever the user enters the website and buys something the column registration gets 1 and the previous 30 days get 1 on registration as well.
For example, Someone bought something on 4/13/2020, registration gets 1 and all the visits to the website from the same user - his registration column gets 1 starting from 4/13/2020 up until 3/12/2020. And starting from 3/12/2020 his registration column is 0 again. And whenever in the registration column after 1 comes 0 then visit_number starts to count again. Total_visit here counts all the visits to the website from the same user. There are many users, but total_visit and visit_number are partitioned by visitor_id and date.
Now, I want to get only that last date when registration was 1, instead of getting all the previous 30 days.
So again, For example, Someone bought something on 4/13/2020 then I wanna get this row only, not all previous rows within 30 days.
So it should eventually look like this:
date visit_number total_visit registration
1/1/2020 . 1 1 0 --got
2/2/2020 . 9 9 1 --got
2/6/2020 . 1 10 0 --got
2/10/2020 . 2 11 0 --got
2/14/2020 . 3 12 0 --got
2/18/2020 . 4 13 0 --got
4/13/2020 . 10 19 1 --got
5/15/2020 . 1 20 0 --got
6/20/2020 . 2 21 1 --got
remember, registration = 0 here because the user did not buy anything on that day and he did not buy anything for the next 30 days.
I wrote down conditions to help me get my head around the problem:
1. if day difference between the date of current row and the date of previous row is bigger than 30 days, get both rows
For example: 6/20/2020 and 5/15/2020 difference is more than 30 days, therefore we got both
2. if date difference is smaller than 30 days, then check for current row, and if its = 1 then take that only, and remove all 30 days
3. if registration is 0, that easily indicates that we get it automatically.
I tried different things,
select visitor_id, dt1, reg30,
case when date_part('day',dt1) - date_part('day',lag(dt1) over(partition by visitor_id order by dt)) > 30 then 'True'
when date_part('day',dt1) - date_part('day',lag(dt1) over(partition by visitor_id order by dt)) < 30 and reg30='1' then 'True'
when reg30 = '0' then 'True'
else 'False'
end
from new_table
order by visitor_id, dt
CodePudding user response:
Disclaimer: I don't use PostgreSQL so there may be a more efficient method, but a general approach that may work is using LEAD() to peek at the registration
value in the next row. Then return rows where either:
- The current
registration
value =1
and the next value =0
(or the next value is null, meaning the last row)... or - current
registration
value =0
SQL:
WITH cte AS (
SELECT *
, LEAD(registrated, 1) OVER(ORDER BY id, "date") AS NextValue
FROM YourTable
)
SELECT *
FROM cte
WHERE registrated = 0
OR ( registrated = 1 AND
( NextValue = 0 OR NextValue IS NULL )
)
Results:
id | date | visit_number | total_visit | registrated | nextvalue -: | :------------------ | -----------: | ----------: | ----------: | --------: 1 | 2020-01-01 00:00:00 | 1 | 1 | 0 | 1 1 | 2020-02-02 00:00:00 | 9 | 9 | 1 | 0 1 | 2020-02-06 00:00:00 | 1 | 10 | 0 | 0 1 | 2020-02-10 00:00:00 | 2 | 11 | 0 | 0 1 | 2020-02-14 00:00:00 | 3 | 12 | 0 | 0 1 | 2020-02-18 00:00:00 | 4 | 13 | 0 | 1 1 | 2020-04-13 00:00:00 | 10 | 19 | 1 | 0 1 | 2020-05-15 00:00:00 | 1 | 20 | 0 | 1 1 | 2020-06-20 00:00:00 | 2 | 21 | 1 | null
db<>fiddle here