I have a billion row dataset which is constantly building with more repeat data on customers.
ID creation_date report_date status
001 2021-01-20T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-02T00:23:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
003 2021-02-01T00:22:06Z 2021-02-19T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-02-20T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-20T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-21T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-21T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-02-30T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-02-30T00:22:06Z EXPIRED
001 2021-01-20T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
002 2021-01-30T00:22:06Z 2021-03-01T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-01T00:22:06Z ACTIVE
001 2021-01-20T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
002 2021-01-30T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
003 2021-02-01T00:22:06Z 2021-03-22T00:22:06Z EXPIRED
Each report_date indicates a date where all records were updated to their current status. Like a pulse check.
And all I would like is the last status of which the user was during the week after a month from creation date (week 5).
As an example: ID = 001.
Here we see their creation date is 2021-01-20
, meaning one month from this date is 2021-02-20
. I would like to know:
- What was the final status of this user during the report dates between
2021-02-20
and2021-02-27
?
You can see in the above data that Active remained active in all of the reports between 2021-02-20
and 2021-02-27
(that are listed.)
To make things simple, we only want to know the LAST change of status in this time frame. Notice in ID=003, they swapped to ACTIVE
on 2021-02-22', so though they were
EXPIRED` the day before, they switched back to active within the boundary.
Anything after a week after the month (anything after 5 weeks) is irrelevant.
You also may notice that 1 month from 2021-01-30
is 2021-02-30
which doesnt make sense. In these cases, use the final date of the month, or 2021-02-28
.
Final output:
ID week_5_status
001 ACTIVE
002 EXPIRED
003 ACTIVE
CodePudding user response:
First, convert the text values (presumably) to valid datetime values. Then, filter the rows such that report_datetime is fewer than 6 weeks after creation_datetime. Take the max of that filtered list, then join back to the original data to get the status for the row with the max value.
CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
INSERT INTO t (id,creation_date,report_date,status) VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');
WITH dat
AS
(
SELECT id
, CAST(creation_date AS datetime) AS creation_datetime
, CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
, status
FROM t
),
dat2
AS
(
SELECT id
,MAX(report_datetime) AS max_report_datetime
FROM dat
WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
GROUP BY id
)
SELECT dat.*
FROM dat
INNER JOIN dat2
ON dat.id = dat2.id
AND dat.report_datetime = dat2.max_report_datetime;
CodePudding user response:
QUALIFY and ROW_NUMBER seem to be what you want to select the "last state inside the window"
So with a CTE for data
, which I changed a couple of invalid report_date
to be real dates. Perhaps in the wrong direction, but it doesn't impact the SQL.
WITH data(id,creation_date,report_date,status)AS (
SELECT column1
,to_date(column2, 'YYYY-MM-DDThh:mi:ss')
,to_date(column3, 'YYYY-MM-DDThh:mi:ss')
,column4
FROM VALUES
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-28T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-28T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED')
)
The main SQL becomes:
SELECT d.id
,d.creation_date
,d.report_date
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
gives:
ID | CREATION_DATE | REPORT_DATE | STATUS |
---|---|---|---|
1 | 2021-01-20 | 2021-02-21 | ACTIVE |
2 | 2021-01-30 | 2021-03-01 | EXPIRED |
3 | 2021-02-01 | 2021-03-01 | ACTIVE |
Or you literally want only the two columns:
SELECT d.id
,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;
gives:
ID | STATUS |
---|---|
1 | ACTIVE |
2 | EXPIRED |
3 | ACTIVE |