`dplyr::between` over millions of intervals: how to make it faster-CodePudding

I have a data frame with startValue (in column X1) and endValue (in column X2), and a list of outliers. I need to count how many outliers fall within those specific startValue and endValue. Here is a minimal example and my attempt:

library("dplyr")

set.seed(2022)

dat = data.frame(matrix(sort(sample(1:500, 50)), ncol = 2, byrow = T))
dim(dat)
dat$Outliers = NA

outliers = sort(sample(1:500, 30))

for (i in 1:25){
  dat$Outliers[i] = length(which(between(outliers, dat$X1[i], dat$X2[i]) == T))
}

This code works fine. But my real data have millions of row and this for loop takes a lot of time. Is there a faster way to solve this problem?

CodePudding user response：

a shallow fix at coding level

You will see an immediate speedup, with the following modification:

out <- integer(nrow(dat))
for (i in 1:nrow(dat)) {
  out[i] <- sum(between(outliers, dat$X1[i], dat$X2[i]))
}
dat$Outliers1 <- out
with(dat, identical(Outliers, Outliers1))  ## the result is correct
#[1] TRUE

This avoids modifying a data.frame during a loop, hence a lot of unnecessary memory copies (which is particularly important if you have a big data.frame).
length(which(LogicalVec == TRUE)) is awkward; use sum(LogicalVec).

You can also try replacing this loop using:

dat$Outliers2 <- mapply(\(a, b) sum(between(outliers, a, b)), dat$X1, dat$X2)
with(dat, identical(Outliers, Outliers2))  ## the result is correct
#[1] TRUE

Henrik mentioned a data.table solution:

library(data.table)
setDT(dat)
d_vals <- as.data.table(outliers)
dat[ , Outliers3 := d_vals[dat, on = .(outliers >= X1, outliers <= X2), .N, by = .EACHI]$N]
with(dat, identical(Outliers, Outliers3))
#[1] TRUE

a deep fix at algorithm level

If your real data are like what you have simulated, with non-overlapping intervals [startValue, endValue], you can use base::findInterval rather than dplyr::between. The only problem, is that findInterval uses interval [a, b) while between uses [a, b]. So if an outlier value happens to equal an endValue, between will count it twice, whereas findInterval only count it once. This does occur for you minimal example.

vec <- c(with(dat, rbind(X1, X2)))
if (is.unsorted(vec, strictly = TRUE)) stop("cannot apply this method!")
x <- findInterval(outliers, vec, rightmost.closed = TRUE)
x <- x[x %% 2 == 1]  ## get rid of X2[i] ~ X1[i   1]; just keep X1[i] ~ X2[i]
x <- (x   1) / 2
dat$Outliers4 <- tabulate(x, nrow(dat))
with(dat, identical(Outliers, Outliers4))  ## oops, the results are not identical
#[1] FALSE

## you see, the first interval is causing problem
with(dat, Outliers - Outliers4)
#[1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## inspect the first interval
dat[1, 1:2]
#  X1 X2
#1  1  3

## inspect outliers: 3 is an outlier value
## it is not in [1, 3), so not counted by `findInterval`
## but it is in [1, 3], so counted by `between`
outliers
# [1]   1   3  19  32  42  56  87  99 120 218 234 248 276 287 306 316 319 336 351
#[20] 353 361 382 389 400 417 423 425 458 467 480

Well, depending on what you want in the end, you may or may not fix this discrepancy. If you do want to fix, you could do:

i <- which(dat$X2 %in% outliers)
dat$Outliers4[i] <- dat$Outliers4[i]   1L
with(dat, identical(Outliers, Outliers4))
#[1] TRUE

Hang on, this assumes that your outliers have no duplicated values. If some value appears more than once, you need to plus the number of duplicates instead of 1. You need:

i <- which(dat$X2 %in% outliers)
m <- colSums(matrix(outliers %in% dat$X2[i], ncol = length(i)))
dat$Outliers4[i] <- dat$Outliers4[i]   m

benchmark

## a 1000 x 2 data.frame
set.seed(2022)
dat = data.frame(matrix(sort(sample(20000, 2000)), ncol = 2, byrow = T))
outliers = sort(sample(20000, 1000))

## your current method
method0 <- function (dat, outliers) {
  dat$Outliers = NA
  for (i in 1:nrow(dat)) {
    dat$Outliers[i] = length(which(between(outliers, dat$X1[i], dat$X2[i]) == T))
  }
  dat
}

## modified `for`-loop method
method1 <- function (dat, outliers) {
  out <- integer(nrow(dat))
  for (i in 1:nrow(dat)) {
    out[i] <- sum(between(outliers, dat$X1[i], dat$X2[i]))
  }
  dat$Outliers <- out
  dat
}

## `mapply` method
method2 <- function (dat, outliers) {
  dat$Outliers <- mapply(\(a, b) sum(between(outliers, a, b)), dat$X1, dat$X2)
  dat
}

## "data.table" method
method3 <- function (dat, outliers) {
  DT <- as.data.table(dat)
  d_vals <- as.data.table(outliers)
  DT[ , Outliers := d_vals[DT, on = .(outliers >= X1, outliers <= X2), .N, by = .EACHI]$N]
  DT
}

## fineInterval()   tabulate(), with optional discrepancy fix
method4 <- function (dat, outliers, fix = FALSE) {
  vec <- c(with(dat, rbind(X1, X2)))
  if (is.unsorted(vec, strictly = TRUE)) stop("cannot apply this method!")
  x <- findInterval(outliers, vec, rightmost.closed = TRUE)
  x <- x[x %% 2 == 1]
  x <- (x   1) / 2
  out <- tabulate(x, nrow(dat))
  if (fix) {
    i <- which(dat$X2 %in% outliers)
    if (anyDuplicated(outliers)) {
      m <- colSums(matrix(outliers %in% dat$X2[i], ncol = length(i)))
    } else {
      m <- 1L
    }
    out[i] <- out[i]   m
  }
  dat$Outliers <- out
  dat
}

library(microbenchmark)
microbenchmark(method0 = method0(dat, outliers),
               method1 = method1(dat, outliers),
               method2 = method2(dat, outliers),
               method3 = method3(dat, outliers),
               method4F = method4(dat, outliers, fix = FALSE),
               method4T = method4(dat, outliers, fix = TRUE))
#Unit: microseconds
#     expr       min         lq       mean    median        uq       max neval
#  method0 40459.251 40854.4165 44121.8656 41015.071 46436.530 171540.17   100
#  method1 20755.578 20905.8405 21740.9797 21032.032 21309.445  30860.18   100
#  method2 19320.867 19483.9885 20452.9461 19585.355 19808.503  26417.01   100
#  method3  4920.958  5171.5260  5411.6465  5340.284  5379.681  12849.05   100
# method4F   160.382   180.7245   260.6430   217.843   228.279   5548.48   100
# method4T   331.903   353.5260   612.7073   388.385   410.383  23536.62   100

So data.table is the fastest solution at coding level, about 7 ~ 8 times faster. But if intervals do not overlap, then an improved algorithm gives 100 times speedup! The discrepancy fix is not free, but is still fast.

hint

If you do have overlapping intervals, the findInterval tabulate method can still be applied. The key is to work with non-overlapping intervals first, then aggregate the counts. Here is a conceptual illustration. Suppose we have two overlapping intervals:

[1, 3], [2, 5]

We can first separate them to non-overlapping ones:

[1, 2), [2, 3), [3, 5]

Now, findInterval tabulate can find counts on these intervals:

[1, 2)	[2, 3)	[3, 5]
N1	N2	N3

The desired result is an aggregation:

[1, 3)	[2, 5]
N1 N2	N2 N3

Remember: Nothing is faster than using a clever algorithm.