I'm trying to create a histogram based on interval of dates and total number of orders but im having a hard time binning it through SQL.
A simplified table can be seen below
customer_id | Date | count_orders |
---|---|---|
1 | 01-01-2020 | 5 |
1 | 01-13-2020 | 26 |
1 | 02-06-2020 | 11 |
2 | 01-17-2020 | 9 |
3 | 02-04-2020 | 13 |
3 | 03-29-2020 | 24 |
4 | 04-05-2020 | 1 |
5 | 02-23-2020 | 10 |
6 | 03-15-2020 | 7 |
6 | 04-18-2020 | 32 |
... | ... | ... |
and im thinking of binning it into 20 day intervals but the only thing I can think about is do a
SUM(CASE WHEN Date BETWEEN <interval1_startdate> AND <interval1_enddate> ...)
method per interval which if used into the actual data (which contains millions of row) is quite exhausting. So i need help in automating the binning part.
Desired output would either be
1)
interval | total_count |
---|---|
01-01-2020 - 01-20-2020 | 31 |
01-21-2020 - 02-10-2020 | 24 |
02-10-2020 - 03-01-2020 | 10 |
... | ... |
or 2)
start | end | total_count |
---|---|---|
01-01-2020 | 01-20-2020 | 31 |
01-21-2020 | 02-10-2020 | 24 |
02-10-2020 | 03-01-2020 | 10 |
... | ... | ... |
Do you have any ideas?
CodePudding user response:
You can group by the (current date - minimum date)/20. For preso something like this:
WITH dataset (customer_id, Date, count_orders) AS (
VALUES (1, date_parse('01-01-2020', '%m-%d-%Y'), 5),
(1, date_parse('01-13-2020', '%m-%d-%Y'), 26),
(1, date_parse('02-06-2020', '%m-%d-%Y'), 11),
(2, date_parse('01-17-2020', '%m-%d-%Y'), 9),
(3, date_parse('02-04-2020', '%m-%d-%Y'), 13),
(3, date_parse('03-29-2020', '%m-%d-%Y'), 24),
(4, date_parse('04-05-2020', '%m-%d-%Y'), 1),
(5, date_parse('02-23-2020', '%m-%d-%Y'), 10),
(6, date_parse('03-15-2020', '%m-%d-%Y'), 7),
(6, date_parse('04-18-2020', '%m-%d-%Y'), 32)
)
SELECT date_add('day', 20 * grp, min(min_date)) interval_end,
date_add('day', 20 * (grp 1) - 1, min(min_date)) interval_end,
sum(count_orders) total_count
FROM (
SELECT *,
date_diff('day', min(date) over (), date) / 20 as grp,
min(date) over () min_date
FROM dataset
)
group by grp
order by 1
Output:
interval_end | interval_end | total_count |
---|---|---|
2020-01-01 00:00:00.000 | 2020-01-20 00:00:00.000 | 40 |
2020-01-21 00:00:00.000 | 2020-02-09 00:00:00.000 | 24 |
2020-02-10 00:00:00.000 | 2020-02-29 00:00:00.000 | 10 |
2020-03-01 00:00:00.000 | 2020-03-20 00:00:00.000 | 7 |
2020-03-21 00:00:00.000 | 2020-04-09 00:00:00.000 | 25 |
2020-04-10 00:00:00.000 | 2020-04-29 00:00:00.000 | 32 |
CodePudding user response:
This should work
DECLARE @StartDateTime DATETIME = '2020-01-01';
DECLARE @EndDateTime DATETIME = '2021-01-01';
WITH DateRange(DateData) AS
(
SELECT @StartDateTime as Date
UNION ALL
SELECT DATEADD(d,20,DateData)
FROM DateRange
WHERE DateData < @EndDateTime
),
DateRanges as
(
SELECT IntervalStart.DateData as "start", IntervalEnd.DateData AS "end"
FROM DateRange as IntervalStart
inner join DateRange as IntervalEnd
ON IntervalEnd.DateData = DATEADD(day, 20, IntervalStart.DateData)
)
(Select "start", "end",
(SELECT COALESCE(SUM(Table1."count_orders"), 0) FROM Table1
WHERE Table1."Date" BETWEEN "start" AND "end") as total_count
from DateRanges)
Basically you generate all the 20-day intervals you want, then for each interval you get the sum of all the corresponding total_count.
You can try it here: dbfiddle
CodePudding user response:
You can get the intervals using CTE and then get the total using cross apply
.
Drop table Tbl
Create Table Tbl (customer_id Int, [date] Date, count_orders Int)
Insert Into Tbl (customer_id, [date], count_orders)
Values (1,'2020-01-01', 5),
(1,'2020-01-13',26),
(1,'2020-02-06',11),
(2,'2020-01-17',9),
(3,'2020-02-04',13),
(3,'2020-03-29',24),
(4,'2020-04-05',1),
(5,'2020-02-23',10),
(6,'2020-03-15',7),
(6,'2020-04-18',32)
;With A As (
Select Min([date]) As start, DateAdd(dd,19,Min([date])) As [end], Max([date]) As [max]
From Tbl
Union All
Select DateAdd(dd,1,[end]) As start, DateAdd(dd,20,[end]) As [end], [max]
From A
Where [end]<[max])
Select A.[start], A.[end], T.total_count
From A Cross Apply (Select SUM(count_orders) As total_count
From Tbl Where [date] between A.[start] And A.[end]) As T
Result:
start end total_count
---------- ---------- -----------
2020-01-01 2020-01-20 40
2020-01-21 2020-02-09 24
2020-02-10 2020-02-29 10
2020-03-01 2020-03-20 7
2020-03-21 2020-04-09 25
2020-04-10 2020-04-29 32