I have a dataset that looks like the below:
> head(mydata)
id value1 value2
1: 1 200001 300001
2: 2 200002 300002
3: 3 200003 300003
4: 4 200004 300004
5: 5 200005 300005
6: 6 200006 300006
value1
and value2
represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).
After trying different options that were slower, I am currently using map2
from the purrr
package in combination with approx
from base R. I create the new variable using assignment by reference from the data.table
package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).
Note that I also use get()
to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn't seem to be the bottleneck. Also, I have tried to use the furrr
package to speed up map2
by making the code parallel, but the speed gains were not material.
Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!
mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
CodePudding user response:
Just use the formula for linear interpolation to vectorize over the whole data.table
.
mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
system.time({
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
})
#> user system elapsed
#> 41.50 0.53 42.05
system.time({
mydata[, interpolated_value2 := get(floor_value) m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0 0 0
identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE
It also works just as fast when m
is a vector.
m <- sample(12, 1e6 1, 1)
system.time({
mydata[, interpolated_value2 := get(floor_value) m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0.01 0.00 0.02
CodePudding user response:
Is it faster with map2_dbl
? map2
will return a list, but I think you want a numeric result instead.