Home > database >  How to aggregate over date including all prior dates
How to aggregate over date including all prior dates

Time:12-15

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.

| ID | Mrc | insert_dt  |
|----|-----|------------|
| 1  | 40  | 2022-01-01 |
| 2  | 30  | 2022-01-01 |
| 3  | 50  | 2022-01-01 |
| 4  | 20  | 2022-02-01 |
| 5  | 45  | 2022-02-01 |
| 6  | 55  | 2022-03-01 |

Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:

| Mrc | insert_dt  |
|-----|------------|
| 40  | 2022-01-01 |
| 37  | 2022-02-01 |
| 40  | 2022-03-01 |

How do I write a query to do that?

CodePudding user response:

I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.

The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.

--Enter the sample data you gave as a CTE for testing
;with cteSample as ( 
    SELECT * FROM ( VALUES 
        (1, 40, CONVERT(date,'2022-01-01'))
        , ('2', '30', '2022-01-01')
        , ('3', '50', '2022-01-01')
        , ('4', '20', '2022-02-01')
        , ('5', '45', '2022-02-01')
        , ('6', '55', '2022-03-01')
    ) as TabA(ID, Mrc, insert_dt) 
)--Solution begins here, find the total and count for each date 
--because window can only handle a single "last row"
, cteGrouped as (
    SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
    FROM cteSample
    GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as ( 
    SELECT insert_dt
        , SUM(MRCSum) OVER (ORDER BY insert_dt RANGE 
            BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
        , SUM(MRCCount) OVER (ORDER BY insert_dt RANGE 
            BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
    FROM    cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T

This gives the following output

insert_dt MRCAverage
2022-01-01 40
2022-02-01 37
2022-03-01 40

CodePudding user response:

Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.

select mrc, insert_dt from 
(
 select avg(mrc) over (order by insert_dt, id) mrc, insert_dt, 
        row_number() over (partition by insert_dt order by id desc) rn
 from the_table
) t
where rn = 1
order by insert_dt;

DB-fiddle demo

Update If the_table has no id column then use a CTE to add one.

with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from 
(
 select avg(mrc) over (order by insert_dt, id) mrc, insert_dt, 
        row_number() over (partition by insert_dt order by id desc) rn
 from t_id 
) t
where rn = 1
order by insert_dt;
  • Related