Home > database >  Find longest rolling sum under/equal a threshold for each group
Find longest rolling sum under/equal a threshold for each group

Time:12-17

I am trying to find the most efficient way to identify the longest rolling sum of a vector where the value is below a specific threshold. For instance, if we had 1:10 and a threshold of 6, then 3 is our longest rolling sum.

I have got the below code and function I've made for this, but is obviously extremely slow. Wondering if there's a more efficient or already implemented algorithm for this that identifies the longest runs with sums under the threshold.

library(dplyr)
library(zoo)

set.seed(1)

df <- data.frame(
  g = rep(letters[1:10], each = 100),
  x = runif(1000)
)

longest_rollsum <- function(x, threshold) {
  for (i in 1:length(x)) {
    rs <- rollsum(x, i, na.pad = TRUE, align = "right")
    if (!any(rs <= threshold, na.rm = TRUE)) {
      return(i - 1)
    }
  }
  return(i)
}

df %>%
  group_by(g) %>%
  summarize(longest = longest_rollsum(x, 2))
#> # A tibble: 10 × 2
#>    g     longest
#>    <chr>   <dbl>
#>  1 a           7
#>  2 b           6
#>  3 c           9
#>  4 d           6
#>  5 e           6
#>  6 f           8
#>  7 g           6
#>  8 h           7
#>  9 i          11
#> 10 j           9

CodePudding user response:

One way to improve performance would be to use the roll_sum() function from the RcppRoll package instead of rollsum() from the zoo package:

library(tidyverse)
library(zoo)
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric

set.seed(1)

df <- data.frame(
  g = rep(letters[1:10], each = 100),
  x = runif(1000)
)

longest_rollsum <- function(x, threshold) {
  for (i in 1:length(x)) {
    rs <- zoo::rollsum(x, i, na.pad = TRUE, align = "right")
    if (!any(rs <= threshold, na.rm = TRUE)) {
      return(i - 1)
    }
  }
  return(i)
}

longest_rollsum_rcpp <- function(x, threshold) {
  for (i in 1:length(x)) {
    rs <- RcppRoll::roll_sum(x, i, align = "right", fill=NA)
    if (!any(rs <= threshold, na.rm = TRUE)) {
      return(i - 1)
    }
  }
  return(i)
}

rollsum <- function(df){
  df %>%
    group_by(g) %>%
    summarize(longest = longest_rollsum(x, 2))
}

rollsum_rcpp <- function(df){
  df %>%
    group_by(g) %>%
    summarize(longest = longest_rollsum_rcpp(x, 2))
}

library(microbenchmark)
res <- microbenchmark(rollsum(df), rollsum_rcpp(df))
autoplot(res)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

all_equal(rollsum(df), rollsum_rcpp(df))
#> [1] TRUE

Created on 2021-12-17 by the reprex package (v2.0.1)

CodePudding user response:

You can Write an R function that is faster:

longest_rollsum_R <- function(x, threshold) {
  R_rollsum <- function(k){
    for(i in seq(length(x) - k))
      if(sum(x[i:(i k)]) < threshold) return(FALSE)
    TRUE
  }
  for (i in 1:length(x)) if(R_rollsum(i)) return(i)    
}

Now comparing this to the above,

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
      rollsum(df) 23.440017 25.407785 29.007619 27.906883 31.014434 45.516888   100
 rollsum_rcpp(df)  4.046400  4.499688  5.253406  4.734718  5.596438 14.079618   100
   rollsum_R1(df)  3.798058  4.194639  5.100568  4.710468  5.280267 14.749520   100

The change doesn't seem big. But it is quite huge when we change the threshold:

Unit: milliseconds
                 expr        min         lq       mean    median         uq       max neval
      rollsum(df, 20) 111.336885 130.055676 142.683567 138.11347 147.231358 306.06438   100
 rollsum_rcpp(df, 20)  11.640328  13.170309  15.166030  14.03039  16.060333  31.23998   100
   rollsum_R1(df, 20)   5.993384   7.128607   8.125868   7.54488   8.140206  19.86842   100

You could also write your own code in Rcpp which accounts for short circuiting, that does the work faster than the two approaches so far given.

Save the following in your working directory as longest_rollsum.cpp:

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]

int longest_rollsum_C(NumericVector x, double threshold){
  auto rollsum_t = [&x, threshold](int k) {
    for (int i = 0; i< x.length() - k; i  ) 
      if(sum(x[seq(i, i k)]) < threshold) return false;
    return true;
   };
  for (int i = 0; i<x.length(); i  ) if(rollsum_t(i)) return i;
  return 0;
}

In R: Source the above file

Rcpp::sourceCpp("longest_rollsum.cpp")
rollsum_R <- function(df){
  df %>% 
     group_by(g) %>%
    summarise(longest = longest_rollsum_C(x, 2))
}

microbenchmark::microbenchmark(rollsum(df), rollsum_rcpp(df), rollsum_R(df))
Unit: milliseconds
             expr       min        lq      mean    median        uq      max neval
      rollsum(df) 24.052665 25.018864 26.985276 25.453187 27.479305 37.49629   100
 rollsum_rcpp(df)  4.077397  4.352724  4.755942  4.572804  4.902468 13.10230   100
    rollsum_R(df)  2.271907  2.529000  2.871154  2.714801  2.955849 10.62107   100
  • Related