Is using na.approx depending on dynamical date range (timeframe) possible?-CodePudding

does anyone know if it is possible to use the na.approx function to interpolate depending on a varying timeframe ?

Let's suggest we have a df like (the real df has over 5'000 rows):

Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67

I want to conduct an linear interpolation depending on the date range. For example the two NA's at beginning: 1) 14-5 = 9, 2) Counting the days from 2022-05-02 until 2022-05-06 = 5 days, so we divide 3) 9/5 = 1.8. Value for NA at 2022-05-03 is 6.8 and for 2022-05-06 is 8.6.

Second example at 2022-05-09: 1) 67-15 = 52, 2) 2022-05-08 until 2022-05-10 = 3 days, 3) 52/3 = 17.33333. Value for NA at 2022-05-09 is 32.33333 (= 15 17.33333)

Is this possible to conduct it with the na.approx function? If not, how can I approach this?

CodePudding user response：

I can do it with a bit of a slog, but I may need some help with the date diffs. For the first case there's 5 days between 02/05 & 07/05. In the second there's 2 days, not 3 between 08/05 & 10/05. Have I missed something? :)

Code below:

# get data into required shape, and using data.table package
df <- read.table(text="
Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67
", header=T)
library(data.table)
library(zoo)
library(lubridate)
dt <- as.data.table(df)
dt[, Date := lubridate::ymd(gsub(",","",`Date.`))]
setorder(dt, Date)

# first step, fill in to get the starting value
dt[, Value2 := zoo::na.locf0(Value)]

# group together the rows, only really interested in the NA ones, 
# ensuring they are grouped together. rleid makes a group where it finds new values
dt[, Group := rleid(is.na(Value))]
# find the value after the NA
dt[, ValueNext := shift(Value2, n=1, type="lead")]
# find the dates before and after the NA period
dt[, DatePre := shift(Date, n=1, type="lag")]
dt[, DateNext := shift(Date, n=1, type="lead")]
# find the differences in the values & dates
dt[, ValueDiff := ValueNext[.N]-Value2[1], by=Group]
dt[, DateDiff := as.integer(DateNext[.N]-DatePre[1]), by=Group]
# divide through to get the addition
dt[, ValueAdd := ValueDiff/DateDiff]
# by group, use cumulative sum to add to the starting value
dt[, ValueOut := Value2 cumsum(ValueAdd), by=Group]
# we only care about NA groups, so revert back to original value for other
# cases
dt[!is.na(Value), ValueOut := Value]

# check the NA rows
# ! only difference is I get 2 as the date diff for 2022-05-09, not 3
dt[is.na(Value),]

# Final output
dt[, .(Date, Value, ValueOut)]

CodePudding user response：

Using the data frame DF defined reproducibly in the Note at the end, we see that what is asked for, i.e. linear interpolation with respect to Date is what you get (after correcting the calculation in the question) if you apply na.approx to a zoo series:

library(zoo)
z <- read.zoo(DF)
na.approx(z)
## 2022-05-01 2022-05-02 2022-05-03 2022-05-06 2022-05-07 2022-05-08 2022-05-09 
##        6.0        5.0        6.8       12.2       14.0       15.0       41.0 
## 2022-05-10 
##       67.0

or in terms of the original data frame we can use the x= argument of na.approx to specify that interpolation is with respect to Date.

DF$Date <- as.Date(DF$Date)
transform(DF, Value = na.approx(Value, Date, na.rm = FALSE))
##         Date Value
## 1 2022-05-01   6.0
## 2 2022-05-02   5.0
## 3 2022-05-03   6.8
## 4 2022-05-06  12.2
## 5 2022-05-07  14.0
## 6 2022-05-08  15.0
## 7 2022-05-09  41.0
## 8 2022-05-10  67.0

Suggest you review the documentation using ?na.approx from R.

Note

Lines <- "Date, Value
2022-05-01, 6
2022-05-02, 5
2022-05-03, NA
2022-05-06, NA
2022-05-07, 14
2022-05-08, 15
2022-05-09, NA
2022-05-10, 67"
DF <- read.csv(text = Lines, strip.white = TRUE)