To put the problem in words, I have a massive table which includes subscribers and data for every day. If the subscriber no longer exists, then they will have no more records i.e. SUB123 no longer exists from the 28/10/2021 then this subscriber will have records up every day until 27/10/2021. The problem at hand is that some subscribers have missing dates and this could perhaps be as it is a weekend or other problems. I want to fill these records with null values so that they could be on record.
The current problem:
Subscriber | Date | Rev |
---|---|---|
sub123 | 25/10/2021 | 256 |
sub456 | 25/10/2021 | 282 |
sub123 | 26/10/2021 | 652 |
sub123 | 27/10/2021 | 396 |
sub456 | 28/10/2021 | 132 |
sub456 | 29/10/2021 | 484 |
sub456 | 01/11/2021 | 96 |
sub456 | 02/11/2021 | 45 |
The desired solution:
Subscriber | Date | Rev |
---|---|---|
sub123 | 25/10/2021 | 256 |
sub456 | 25/10/2021 | 282 |
sub123 | 26/10/2021 | 652 |
sub456 | 26/10/2021 | NULL |
sub123 | 27/10/2021 | 396 |
sub456 | 27/10/2021 | NULL |
sub456 | 28/10/2021 | 132 |
sub456 | 29/10/2021 | 484 |
sub456 | 30/10/2021 | NULL |
sub456 | 31/10/2021 | NULL |
sub456 | 01/11/2021 | 96 |
sub456 | 02/11/2021 | 45 |
My current attempt:
WITH all_dates as (
SELECT
CAST(date_column AS DATE) date_column, b.subscriber, b.date
FROM
(VALUES
(SEQUENCE(
min(b.date) OVER (PARTITION BY b.subscriber ORDER BY b.date),
max(b.date) OVER (PARTITION BY b.subscriber ORDER BY b.date),
INTERVAL '1' DAY)
)
) AS t1(date_array)
CROSS JOIN
UNNEST(date_array) AS t2(date_column)
LEFT JOIN MAINTABLE b
on t2.date_column = b.date
),
customer_dates as (
SELECT distinct a.subscriber, a.date, b.date_column
from MAINTABLE a
left join all_dates b
on a.date = b.date_column
)
SELECT *
from customer_dates a
This code doesn't work but its an attempt to what I am trying to accomplish if I were to use the following code that is attached below it will generate dates for all subscribers from the initial date to the end date which is not what we want hence why the above code is what was attempted.
WITH all_dates as (
SELECT
CAST(date_column AS DATE) date_column, b.subscriber, b.date
FROM
(VALUES
(SEQUENCE(
date('2021-10-25'),
date('2022-04-30'),
INTERVAL '1' DAY)
)
) AS t1(date_array)
CROSS JOIN
UNNEST(date_array) AS t2(date_column)
LEFT JOIN MAINTABLE b
on t2.date_column = b.date
),
customer_dates as (
SELECT distinct a.subscriber, a.date, b.date_column
from MAINTABLE a
left join all_dates b
on a.date = b.date_column
)
SELECT *
from customer_dates a
CodePudding user response:
You can use lag
function to generate missing diapasons to flatten with unnest
and handle Rev
additionally:
-- sample data
WITH dataset (Subscriber, Date, Rev) AS (
VALUES ('sub123', date_parse('25-10-2021', '%d-%m-%Y'), 256),
('sub456', date_parse('25-10-2021', '%d-%m-%Y'), 282),
('sub123', date_parse('26-10-2021', '%d-%m-%Y'), 652),
('sub123', date_parse('27-10-2021', '%d-%m-%Y'), 396),
('sub456', date_parse('28-10-2021', '%d-%m-%Y'), 132),
('sub456', date_parse('29-10-2021', '%d-%m-%Y'), 484),
('sub456', date_parse('01-11-2021', '%d-%m-%Y'), 96),
('sub456', date_parse('02-11-2021', '%d-%m-%Y'), 45)
)
-- query
select subscriber, lifted_date as date, if(date = lifted_date, rev, NULL) rev
from
(
select Subscriber,
Rev,
cast(date as date) date,
lag(cast(date as date)) over(partition by Subscriber order by date) prev_date
from dataset
)
cross join unnest(
array_except(sequence(coalesce(prev_date, date), date, interval '1' day), array[prev_date])
) as t(lifted_date)
order by subscriber, date
Output:
subscriber | date | rev |
---|---|---|
sub123 | 2021-10-25 00:00:00.000 | 256 |
sub123 | 2021-10-26 00:00:00.000 | 652 |
sub123 | 2021-10-27 00:00:00.000 | 396 |
sub456 | 2021-10-25 00:00:00.000 | 282 |
sub456 | 2021-10-26 00:00:00.000 | |
sub456 | 2021-10-27 00:00:00.000 | |
sub456 | 2021-10-28 00:00:00.000 | 132 |
sub456 | 2021-10-29 00:00:00.000 | 484 |
sub456 | 2021-10-30 00:00:00.000 | |
sub456 | 2021-10-31 00:00:00.000 | |
sub456 | 2021-11-01 00:00:00.000 | 96 |
sub456 | 2021-11-02 00:00:00.000 | 45 |