Home > Software design >  Missing dates for specific identifiers without adding extra dates when this identifier is no longer
Missing dates for specific identifiers without adding extra dates when this identifier is no longer

Time:05-06

To put the problem in words, I have a massive table which includes subscribers and data for every day. If the subscriber no longer exists, then they will have no more records i.e. SUB123 no longer exists from the 28/10/2021 then this subscriber will have records up every day until 27/10/2021. The problem at hand is that some subscribers have missing dates and this could perhaps be as it is a weekend or other problems. I want to fill these records with null values so that they could be on record.

The current problem:

Subscriber Date Rev
sub123 25/10/2021 256
sub456 25/10/2021 282
sub123 26/10/2021 652
sub123 27/10/2021 396
sub456 28/10/2021 132
sub456 29/10/2021 484
sub456 01/11/2021 96
sub456 02/11/2021 45

The desired solution:

Subscriber Date Rev
sub123 25/10/2021 256
sub456 25/10/2021 282
sub123 26/10/2021 652
sub456 26/10/2021 NULL
sub123 27/10/2021 396
sub456 27/10/2021 NULL
sub456 28/10/2021 132
sub456 29/10/2021 484
sub456 30/10/2021 NULL
sub456 31/10/2021 NULL
sub456 01/11/2021 96
sub456 02/11/2021 45

My current attempt:

WITH all_dates as (
SELECT
     CAST(date_column AS DATE) date_column, b.subscriber, b.date
FROM
 (VALUES
     (SEQUENCE(
      min(b.date) OVER (PARTITION BY b.subscriber ORDER BY b.date),
      max(b.date) OVER (PARTITION BY b.subscriber ORDER BY b.date),
      INTERVAL '1' DAY)
     )
 ) AS t1(date_array)
CROSS JOIN
 UNNEST(date_array) AS t2(date_column) 
LEFT JOIN MAINTABLE b 
on t2.date_column = b.date
), 
customer_dates as (
SELECT distinct a.subscriber, a.date, b.date_column
from MAINTABLE a
left join all_dates b
on a.date = b.date_column
    )
    SELECT *
    from customer_dates a

This code doesn't work but its an attempt to what I am trying to accomplish if I were to use the following code that is attached below it will generate dates for all subscribers from the initial date to the end date which is not what we want hence why the above code is what was attempted.

WITH all_dates as (
SELECT
     CAST(date_column AS DATE) date_column, b.subscriber, b.date
FROM
 (VALUES
     (SEQUENCE(
      date('2021-10-25'),
      date('2022-04-30'),
      INTERVAL '1' DAY)
     )
 ) AS t1(date_array)
CROSS JOIN
 UNNEST(date_array) AS t2(date_column) 
LEFT JOIN MAINTABLE b 
on t2.date_column = b.date
), 
customer_dates as (
SELECT distinct a.subscriber, a.date, b.date_column
from MAINTABLE a
left join all_dates b
on a.date = b.date_column
    )
    SELECT *
    from customer_dates a

CodePudding user response:

You can use lag function to generate missing diapasons to flatten with unnest and handle Rev additionally:

-- sample data
WITH dataset (Subscriber, Date, Rev) AS (
    VALUES ('sub123',   date_parse('25-10-2021', '%d-%m-%Y'),   256),
    ('sub456',  date_parse('25-10-2021', '%d-%m-%Y'),   282),
    ('sub123',  date_parse('26-10-2021', '%d-%m-%Y'),   652),
    ('sub123',  date_parse('27-10-2021', '%d-%m-%Y'),   396),
    ('sub456',  date_parse('28-10-2021', '%d-%m-%Y'),   132),
    ('sub456',  date_parse('29-10-2021', '%d-%m-%Y'),   484),
    ('sub456',  date_parse('01-11-2021', '%d-%m-%Y'),   96),
    ('sub456',  date_parse('02-11-2021', '%d-%m-%Y'),   45)
) 

-- query
select subscriber, lifted_date as date, if(date = lifted_date, rev, NULL) rev
from
(
    select Subscriber, 
        Rev,
        cast(date as date) date, 
        lag(cast(date as date)) over(partition by Subscriber order by date) prev_date
    from dataset
)
cross join unnest(
        array_except(sequence(coalesce(prev_date, date), date, interval '1' day), array[prev_date])
    ) as t(lifted_date)
order by subscriber, date

Output:

subscriber date rev
sub123 2021-10-25 00:00:00.000 256
sub123 2021-10-26 00:00:00.000 652
sub123 2021-10-27 00:00:00.000 396
sub456 2021-10-25 00:00:00.000 282
sub456 2021-10-26 00:00:00.000
sub456 2021-10-27 00:00:00.000
sub456 2021-10-28 00:00:00.000 132
sub456 2021-10-29 00:00:00.000 484
sub456 2021-10-30 00:00:00.000
sub456 2021-10-31 00:00:00.000
sub456 2021-11-01 00:00:00.000 96
sub456 2021-11-02 00:00:00.000 45
  • Related