Home > Software engineering >  Histogram of orders by range of dates
Histogram of orders by range of dates

Time:12-16

I'm trying to create a histogram based on interval of dates and total number of orders but im having a hard time binning it through SQL.

A simplified table can be seen below

customer_id Date count_orders
1 01-01-2020 5
1 01-13-2020 26
1 02-06-2020 11
2 01-17-2020 9
3 02-04-2020 13
3 03-29-2020 24
4 04-05-2020 1
5 02-23-2020 10
6 03-15-2020 7
6 04-18-2020 32
... ... ...

and im thinking of binning it into 20 day intervals but the only thing I can think about is do a

SUM(CASE WHEN Date BETWEEN <interval1_startdate> AND <interval1_enddate> ...)

method per interval which if used into the actual data (which contains millions of row) is quite exhausting. So i need help in automating the binning part.

Desired output would either be

1)

interval total_count
01-01-2020 - 01-20-2020 31
01-21-2020 - 02-10-2020 24
02-10-2020 - 03-01-2020 10
... ...

or 2)

start end total_count
01-01-2020 01-20-2020 31
01-21-2020 02-10-2020 24
02-10-2020 03-01-2020 10
... ... ...

Do you have any ideas?

CodePudding user response:

You can group by the (current date - minimum date)/20. For preso something like this:

WITH dataset (customer_id, Date, count_orders) AS (
    VALUES (1, date_parse('01-01-2020', '%m-%d-%Y'), 5),
        (1, date_parse('01-13-2020', '%m-%d-%Y'), 26),
        (1, date_parse('02-06-2020', '%m-%d-%Y'), 11),
        (2, date_parse('01-17-2020', '%m-%d-%Y'), 9),
        (3, date_parse('02-04-2020', '%m-%d-%Y'), 13),
        (3, date_parse('03-29-2020', '%m-%d-%Y'), 24),
        (4, date_parse('04-05-2020', '%m-%d-%Y'), 1),
        (5, date_parse('02-23-2020', '%m-%d-%Y'), 10),
        (6, date_parse('03-15-2020', '%m-%d-%Y'), 7),
        (6, date_parse('04-18-2020', '%m-%d-%Y'), 32)
)

SELECT date_add('day', 20 * grp, min(min_date)) interval_end,
    date_add('day', 20 * (grp   1) - 1, min(min_date)) interval_end,
    sum(count_orders) total_count
FROM (
        SELECT *,
            date_diff('day', min(date) over (), date) / 20 as grp,
            min(date) over () min_date
        FROM dataset
    )
group by grp
order by 1

Output:

interval_end interval_end total_count
2020-01-01 00:00:00.000 2020-01-20 00:00:00.000 40
2020-01-21 00:00:00.000 2020-02-09 00:00:00.000 24
2020-02-10 00:00:00.000 2020-02-29 00:00:00.000 10
2020-03-01 00:00:00.000 2020-03-20 00:00:00.000 7
2020-03-21 00:00:00.000 2020-04-09 00:00:00.000 25
2020-04-10 00:00:00.000 2020-04-29 00:00:00.000 32

CodePudding user response:

This should work

DECLARE @StartDateTime DATETIME = '2020-01-01';
DECLARE @EndDateTime DATETIME = '2021-01-01';

WITH DateRange(DateData) AS 
(
    SELECT @StartDateTime as Date
    UNION ALL
    SELECT DATEADD(d,20,DateData)
    FROM DateRange 
    WHERE DateData < @EndDateTime
),
DateRanges as
(
    SELECT IntervalStart.DateData as "start", IntervalEnd.DateData AS "end"
    FROM DateRange as IntervalStart
    inner join  DateRange as IntervalEnd
    ON IntervalEnd.DateData = DATEADD(day, 20, IntervalStart.DateData)
)
(Select "start", "end",
    (SELECT COALESCE(SUM(Table1."count_orders"), 0) FROM Table1 
        WHERE Table1."Date" BETWEEN "start" AND "end") as total_count
    from DateRanges)

Basically you generate all the 20-day intervals you want, then for each interval you get the sum of all the corresponding total_count.

You can try it here: dbfiddle

CodePudding user response:

You can get the intervals using CTE and then get the total using cross apply.

Drop table Tbl
Create Table Tbl (customer_id Int, [date] Date, count_orders Int)

Insert Into Tbl (customer_id, [date], count_orders)
Values (1,'2020-01-01', 5),
       (1,'2020-01-13',26),
       (1,'2020-02-06',11),
       (2,'2020-01-17',9),
       (3,'2020-02-04',13),
       (3,'2020-03-29',24),
       (4,'2020-04-05',1),
       (5,'2020-02-23',10),
       (6,'2020-03-15',7),
       (6,'2020-04-18',32)


;With A As (
Select Min([date]) As start, DateAdd(dd,19,Min([date])) As [end], Max([date]) As [max]
From Tbl
Union All
Select DateAdd(dd,1,[end]) As start, DateAdd(dd,20,[end]) As [end], [max]
From A
Where [end]<[max])
Select A.[start], A.[end], T.total_count
From A Cross Apply (Select SUM(count_orders) As total_count 
                    From Tbl Where [date] between A.[start] And A.[end]) As T

Result:

start      end        total_count
---------- ---------- -----------
2020-01-01 2020-01-20 40
2020-01-21 2020-02-09 24
2020-02-10 2020-02-29 10
2020-03-01 2020-03-20 7
2020-03-21 2020-04-09 25
2020-04-10 2020-04-29 32
  • Related