Combination of approx and map2 is surprisingly slow-CodePudding

I have a dataset that looks like the below:

> head(mydata)
   id value1 value2
1:  1 200001 300001
2:  2 200002 300002
3:  3 200003 300003
4:  4 200004 300004
5:  5 200005 300005
6:  6 200006 300006

value1 and value2 represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).

After trying different options that were slower, I am currently using map2 from the purrr package in combination with approx from base R. I create the new variable using assignment by reference from the data.table package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).

Note that I also use get() to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn't seem to be the bottleneck. Also, I have tried to use the furrr package to speed up map2 by making the code parallel, but the speed gains were not material.

Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!

mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7

monthly_sum_assured <- function(a, b, m) {
  monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}

mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
                                    ~ monthly_sum_assured(.x, .y, m))]

CodePudding user response：

Just use the formula for linear interpolation to vectorize over the whole data.table.

mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7

monthly_sum_assured <- function(a, b, m) {
  monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}

system.time({
  mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
                                      ~ monthly_sum_assured(.x, .y, m))]
})
#>    user  system elapsed 
#>   41.50    0.53   42.05

system.time({
  mydata[, interpolated_value2 := get(floor_value)   m*(get(ceiling_value) - get(floor_value))/12]
})
#>    user  system elapsed 
#>       0       0       0

identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE

It also works just as fast when m is a vector.

m <- sample(12, 1e6   1, 1)
system.time({
  mydata[, interpolated_value2 := get(floor_value)   m*(get(ceiling_value) - get(floor_value))/12]
})
#>    user  system elapsed 
#>    0.01    0.00    0.02

CodePudding user response：

Is it faster with map2_dbl? map2 will return a list, but I think you want a numeric result instead.