Home > database >  Look at running median 30 days back from current date
Look at running median 30 days back from current date

Time:11-25

Background: I have a table like the below where I'm trying to get the median value based on a rolling 30 day lookback from each created_date partitioned by each city.

The problem is that I have missing dates for some cities so bound by rows preceding won't work in this situation.

Note: I do have a date-spine table that I can leverage, but not sure if it's helpful here

CREATED_DATE CITY VALUE
2018-08-30 Charlotte 374900
2018-08-31 Charlotte 272000
2018-09-10 Charlotte 1
2018-09-24 Charlotte 365000
2018-10-04 Charlotte 342000
2018-10-07 Charlotte 460000
2018-10-08 Charlotte 91000
2018-10-15 Charlotte 342000
2018-10-18 Charlotte 155000
2018-10-19 Charlotte 222000
... ... ...

Expected Output:

CREATED_DATE CITY VALUE MOVING_MEDIAN_30_DAY
2018-08-30 Charlotte 374900 374900
2018-08-31 Charlotte 272000 323450
2018-09-10 Charlotte 1 272000
2018-09-24 Charlotte 365000 318500
2018-10-04 Charlotte 342000 342000
2018-10-07 Charlotte 460000 353500
2018-10-08 Charlotte 91000 342000
2018-10-15 Charlotte 342000 342000
2018-10-18 Charlotte 155000 342000
2018-10-19 Charlotte 222000 342000
... ... ... ...

Question: How can I get to my expected output using sql / snowflake?

Thanks in advance!!

CodePudding user response:

Since Snowflake's MEDIAN window function doesn't support sliding frames, we must take another approach. We can use Snowflake's MEDIAN aggregate function, along with a self-join, to simulate a window function.

 SELECT original.created_date
      , original.city
      , original.value
      , MEDIAN(window.value) AS rolling_30_day_median
   FROM cities AS original
   LEFT JOIN cities AS window
     ON original.city = window.city
    AND DATEDIFF(days, original.created_date, window.created_date) BETWEEN -30 AND 0
  GROUP BY 1, 2, 3
  ORDER BY 1
;

This produces the desired output.

CREATED_DATE CITY VALUE ROLLING_30_DAY_MEDIAN
2018-08-30 Charlotte 374,900 374,900
2018-08-31 Charlotte 272,000 323,450
2018-09-10 Charlotte 1 272,000
2018-09-24 Charlotte 365,000 318,500
2018-10-04 Charlotte 342,000 342,000
2018-10-07 Charlotte 460,000 353,500
2018-10-08 Charlotte 91,000 342,000
2018-10-15 Charlotte 342,000 342,000
2018-10-18 Charlotte 155,000 342,000
2018-10-19 Charlotte 222,000 342,000

CodePudding user response:

I tried using the MEDIAN but looks like it is not getting the value you want, what you may want to do is use LAG function and create a moving average variable. THis is non the assumption of that you have data for all the days and the 1st 29 days you may get 0 value.

  select CREATED_DATE,CITY, VALUE, --month(CREATED_DATE),
    median (value) over (Partition by city, month(CREATED_DATE)) as med,
 (value   lag(value,1) over (Partition by city order by CREATED_DATE)   
lag(value,2) over (Partition by city order by CREATED_DATE)   
lag(value,3)over (Partition by city order by CREATED_DATE)  
--....
lag(value,30) over (Partition by city order by CREATED_DATE) ) / 30  as movingaverage 
--over (order by CREATED_DATE)
 from   test_dt;



        create or replace table test_dt as 
    (select to_date('2018-08-30') as CREATED_DATE,  'Charlotte' as CITY, 374900 as VALUE UNION ALL
    select'2018-08-31','Charlotte', 272000 from dual UNION ALL
    select'2018-09-10','Charlotte', 1 from dual UNION ALL
    select'2018-09-24','Charlotte', 365000 from dual UNION ALL
    select'2018-10-04','Charlotte', 342000 from dual UNION ALL
    select'2018-10-07','460000',0 from dual UNION ALL
    select'2018-10-08','Charlotte', 91000 from dual UNION ALL
    select'2018-10-15','Charlotte', 342000 from dual UNION ALL
    select'2018-10-18','Charlotte', 155000 from dual UNION ALL
    select'2018-10-19','Charlotte', 222000
    );
    

    select CREATED_DATE,CITY, VALUE, --month(CREATED_DATE),
    median (value) over (Partition by city, month(CREATED_DATE)) as med from   test_dt;

CodePudding user response:

This is a great problem to solve with a UDTF:

with data as (
    select $1 n, '2021-01-01'::date   7*seq8() d, 'Rivendell' city
    from values(1),(2),(3),(33),(10),(-5),(13),(20),(40),(80),(1),(4),(-5)
)

select a.*, median
from data a
    , table(running_median_30_days(n::float, d) over(partition by city order by d));

enter image description here enter image description here

The definition of the UDTF is more complicated that the self-join solution by TJ, but it helps encapsulate the desired behavior behind a named function:

create or replace function running_median_30_days(N float, D date)
returns table(MEDIAN float)
language javascript
as $$
{
    processRow: function f(row, rowWriter, context){
        // Store history.
        this.pointarr.push({n:row.N, d:row.D});
        // Discard >30 days.
        this.pointarr = this.pointarr.filter(x=> this.dateDiffInDays(x.d, row.D)<30);
        // Return median.
        rowWriter.writeRow({MEDIAN: this.median(this.pointarr.map(x=> x.n))});
    }        
    , initialize: function(argumentInfo, context) {
        this.pointarr = [];
        this.median = function(values){
            // https://stackoverflow.com/a/45309555/132438
            values.sort(function(a,b){
                return a-b;
            });
            var half = Math.floor(values.length / 2);
            if (values.length % 2) return values[half];
            return (values[half - 1]   values[half]) / 2.0;
        }
        this.dateDiffInDays = function (a, b) {
            // https://stackoverflow.com/a/15289883/132438
            var utc1 = Date.UTC(a.getFullYear(), a.getMonth(), a.getDate());
            var utc2 = Date.UTC(b.getFullYear(), b.getMonth(), b.getDate());
            return Math.floor((utc2 - utc1) / (1000 * 60 * 60 * 24));
        }
    }
}
$$;

What complicated the solution further:

  • Sorting numbers and calculating medians in JS is not straightforward, so we had to define functions to do so.
  • Finding days between dates in JS is not straightforward, so we had to define a function for that too.

As advantages:

  • A UDTF should scale better than a self join, since it processes each row only once.
  • The logic for storing only 30 days of history and calculating the median lives within it.

Read more:

  • Related