Home > Software engineering >  Partitioning and selecting a value in Snowflake based on distance from date
Partitioning and selecting a value in Snowflake based on distance from date

Time:02-15

I have a billion row dataset which is constantly building with more repeat data on customers.

ID   creation_date          report_date             status
001  2021-01-20T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-20T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-20T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-20T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-02-21T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-21T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-21T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-02-30T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-30T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-30T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-03-01T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-03-01T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-03-01T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED
002  2021-01-30T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED

Each report_date indicates a date where all records were updated to their current status. Like a pulse check.

And all I would like is the last status of which the user was during the week after a month from creation date (week 5).

As an example: ID = 001.

Here we see their creation date is 2021-01-20, meaning one month from this date is 2021-02-20. I would like to know:

  • What was the final status of this user during the report dates between 2021-02-20 and 2021-02-27?

You can see in the above data that Active remained active in all of the reports between 2021-02-20 and 2021-02-27 (that are listed.)

To make things simple, we only want to know the LAST change of status in this time frame. Notice in ID=003, they swapped to ACTIVE on 2021-02-22', so though they were EXPIRED` the day before, they switched back to active within the boundary.

Anything after a week after the month (anything after 5 weeks) is irrelevant.

You also may notice that 1 month from 2021-01-30 is 2021-02-30 which doesnt make sense. In these cases, use the final date of the month, or 2021-02-28.

Final output:

ID    week_5_status
001          ACTIVE   
002         EXPIRED
003          ACTIVE

CodePudding user response:

First, convert the text values (presumably) to valid datetime values. Then, filter the rows such that report_datetime is fewer than 6 weeks after creation_datetime. Take the max of that filtered list, then join back to the original data to get the status for the row with the max value.

CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
INSERT INTO t (id,creation_date,report_date,status) VALUES 
(1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
(1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
(1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
(3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');


WITH dat
AS
(
SELECT id
, CAST(creation_date AS datetime) AS creation_datetime
, CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
, status
FROM t
),
dat2
AS
(
SELECT id
,MAX(report_datetime) AS max_report_datetime
FROM dat
WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
GROUP BY id
)
SELECT dat.*
FROM dat
     INNER JOIN dat2 
             ON dat.id = dat2.id 
            AND dat.report_datetime = dat2.max_report_datetime;

dbfiddle.uk

CodePudding user response:

QUALIFY and ROW_NUMBER seem to be what you want to select the "last state inside the window"

So with a CTE for data, which I changed a couple of invalid report_date to be real dates. Perhaps in the wrong direction, but it doesn't impact the SQL.

 WITH data(id,creation_date,report_date,status)AS (
     SELECT column1
        ,to_date(column2, 'YYYY-MM-DDThh:mi:ss')
        ,to_date(column3, 'YYYY-MM-DDThh:mi:ss')
        ,column4 
     FROM VALUES 
    (1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-02-28T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-28T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-28T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
    (2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED')
)

The main SQL becomes:

SELECT d.id
    ,d.creation_date
    ,d.report_date
    ,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;

gives:

ID CREATION_DATE REPORT_DATE STATUS
1 2021-01-20 2021-02-21 ACTIVE
2 2021-01-30 2021-03-01 EXPIRED
3 2021-02-01 2021-03-01 ACTIVE

Or you literally want only the two columns:

SELECT d.id
    ,d.status
FROM data AS d
WHERE dateadd(week,5,d.creation_date) >= d.report_date
QUALIFY row_number() over (partition by id order by report_date desc) = 1 ;

gives:

ID STATUS
1 ACTIVE
2 EXPIRED
3 ACTIVE
  • Related