I have a Spark SQL table in Databricks with missing values (null) that takes the following form:
date price stock item_code
2020-01-01 17.99 6 01
2020-01-02 17.99 null 01
2020-01-03 17.99 4 01
2020-01-04 17.99 null 01
2020-01-01 11.59 null 02
2020-01-02 11.59 8 02
2020-01-03 11.59 null 02
2020-01-04 11.59 4 02
For each item_code
, I have data across the same range of date
. The stock
column is subject to reporting errors and missing values. Sometimes these missing values occur at the start/end of the date range.
In the case where an item has values for the before and after period, I'd like to use a simple average of the recorded value before and after to interpolate the null. When the missing value is at the start or end of the date range, I'd like to fill the first or last available stock
value for that item_code
. My objective table is like this:
date price stock item_code
2020-01-01 17.99 6 01
2020-01-02 17.99 5 01
2020-01-03 17.99 4 01
2020-01-04 17.99 4 01
2020-01-01 11.59 8 02
2020-01-02 11.59 8 02
2020-01-03 11.59 6 02
2020-01-04 11.59 4 02
I think I could use a window function to find the average of the before/after period, but I'm not sure how to build the window. Any advice would be appreciated.
CodePudding user response:
ANALYTICAL FUNCTION
should help this out:
SELECT
date,
price,
CASE
WHEN stock Is NULL then AVG(stock) OVER( PARTITION BY item_code ORDER BY item_code, date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
ELSE stock
END AS stock,
item_code
FROM t
ORDER BY item_code, date
See DEMO