I have a data frame with startValue (in column X1
) and endValue (in column X2
), and a list of outliers. I need to count how many outliers fall within those specific startValue and endValue. Here is a minimal example and my attempt:
library("dplyr")
set.seed(2022)
dat = data.frame(matrix(sort(sample(1:500, 50)), ncol = 2, byrow = T))
dim(dat)
dat$Outliers = NA
outliers = sort(sample(1:500, 30))
for (i in 1:25){
dat$Outliers[i] = length(which(between(outliers, dat$X1[i], dat$X2[i]) == T))
}
This code works fine. But my real data have millions of row and this for
loop takes a lot of time. Is there a faster way to solve this problem?
CodePudding user response:
a shallow fix at coding level
You will see an immediate speedup, with the following modification:
out <- integer(nrow(dat))
for (i in 1:nrow(dat)) {
out[i] <- sum(between(outliers, dat$X1[i], dat$X2[i]))
}
dat$Outliers1 <- out
with(dat, identical(Outliers, Outliers1)) ## the result is correct
#[1] TRUE
This avoids modifying a data.frame during a loop, hence a lot of unnecessary memory copies (which is particularly important if you have a big data.frame).
length(which(LogicalVec == TRUE))
is awkward; usesum(LogicalVec)
.
You can also try replacing this loop using:
dat$Outliers2 <- mapply(\(a, b) sum(between(outliers, a, b)), dat$X1, dat$X2)
with(dat, identical(Outliers, Outliers2)) ## the result is correct
#[1] TRUE
Henrik mentioned a data.table solution:
library(data.table)
setDT(dat)
d_vals <- as.data.table(outliers)
dat[ , Outliers3 := d_vals[dat, on = .(outliers >= X1, outliers <= X2), .N, by = .EACHI]$N]
with(dat, identical(Outliers, Outliers3))
#[1] TRUE
a deep fix at algorithm level
If your real data are like what you have simulated, with non-overlapping intervals [startValue, endValue]
, you can use base::findInterval
rather than dplyr::between
. The only problem, is that findInterval
uses interval [a, b) while between
uses [a, b]. So if an outlier value happens to equal an endValue, between
will count it twice, whereas findInterval
only count it once. This does occur for you minimal example.
vec <- c(with(dat, rbind(X1, X2)))
if (is.unsorted(vec, strictly = TRUE)) stop("cannot apply this method!")
x <- findInterval(outliers, vec, rightmost.closed = TRUE)
x <- x[x %% 2 == 1] ## get rid of X2[i] ~ X1[i 1]; just keep X1[i] ~ X2[i]
x <- (x 1) / 2
dat$Outliers4 <- tabulate(x, nrow(dat))
with(dat, identical(Outliers, Outliers4)) ## oops, the results are not identical
#[1] FALSE
## you see, the first interval is causing problem
with(dat, Outliers - Outliers4)
#[1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## inspect the first interval
dat[1, 1:2]
# X1 X2
#1 1 3
## inspect outliers: 3 is an outlier value
## it is not in [1, 3), so not counted by `findInterval`
## but it is in [1, 3], so counted by `between`
outliers
# [1] 1 3 19 32 42 56 87 99 120 218 234 248 276 287 306 316 319 336 351
#[20] 353 361 382 389 400 417 423 425 458 467 480
Well, depending on what you want in the end, you may or may not fix this discrepancy. If you do want to fix, you could do:
i <- which(dat$X2 %in% outliers)
dat$Outliers4[i] <- dat$Outliers4[i] 1L
with(dat, identical(Outliers, Outliers4))
#[1] TRUE
Hang on, this assumes that your outliers
have no duplicated values. If some value appears more than once, you need to plus the number of duplicates instead of 1. You need:
i <- which(dat$X2 %in% outliers)
m <- colSums(matrix(outliers %in% dat$X2[i], ncol = length(i)))
dat$Outliers4[i] <- dat$Outliers4[i] m
benchmark
## a 1000 x 2 data.frame
set.seed(2022)
dat = data.frame(matrix(sort(sample(20000, 2000)), ncol = 2, byrow = T))
outliers = sort(sample(20000, 1000))
## your current method
method0 <- function (dat, outliers) {
dat$Outliers = NA
for (i in 1:nrow(dat)) {
dat$Outliers[i] = length(which(between(outliers, dat$X1[i], dat$X2[i]) == T))
}
dat
}
## modified `for`-loop method
method1 <- function (dat, outliers) {
out <- integer(nrow(dat))
for (i in 1:nrow(dat)) {
out[i] <- sum(between(outliers, dat$X1[i], dat$X2[i]))
}
dat$Outliers <- out
dat
}
## `mapply` method
method2 <- function (dat, outliers) {
dat$Outliers <- mapply(\(a, b) sum(between(outliers, a, b)), dat$X1, dat$X2)
dat
}
## "data.table" method
method3 <- function (dat, outliers) {
DT <- as.data.table(dat)
d_vals <- as.data.table(outliers)
DT[ , Outliers := d_vals[DT, on = .(outliers >= X1, outliers <= X2), .N, by = .EACHI]$N]
DT
}
## fineInterval() tabulate(), with optional discrepancy fix
method4 <- function (dat, outliers, fix = FALSE) {
vec <- c(with(dat, rbind(X1, X2)))
if (is.unsorted(vec, strictly = TRUE)) stop("cannot apply this method!")
x <- findInterval(outliers, vec, rightmost.closed = TRUE)
x <- x[x %% 2 == 1]
x <- (x 1) / 2
out <- tabulate(x, nrow(dat))
if (fix) {
i <- which(dat$X2 %in% outliers)
if (anyDuplicated(outliers)) {
m <- colSums(matrix(outliers %in% dat$X2[i], ncol = length(i)))
} else {
m <- 1L
}
out[i] <- out[i] m
}
dat$Outliers <- out
dat
}
library(microbenchmark)
microbenchmark(method0 = method0(dat, outliers),
method1 = method1(dat, outliers),
method2 = method2(dat, outliers),
method3 = method3(dat, outliers),
method4F = method4(dat, outliers, fix = FALSE),
method4T = method4(dat, outliers, fix = TRUE))
#Unit: microseconds
# expr min lq mean median uq max neval
# method0 40459.251 40854.4165 44121.8656 41015.071 46436.530 171540.17 100
# method1 20755.578 20905.8405 21740.9797 21032.032 21309.445 30860.18 100
# method2 19320.867 19483.9885 20452.9461 19585.355 19808.503 26417.01 100
# method3 4920.958 5171.5260 5411.6465 5340.284 5379.681 12849.05 100
# method4F 160.382 180.7245 260.6430 217.843 228.279 5548.48 100
# method4T 331.903 353.5260 612.7073 388.385 410.383 23536.62 100
So data.table is the fastest solution at coding level, about 7 ~ 8 times faster. But if intervals do not overlap, then an improved algorithm gives 100 times speedup! The discrepancy fix is not free, but is still fast.
hint
If you do have overlapping intervals, the findInterval
tabulate
method can still be applied. The key is to work with non-overlapping intervals first, then aggregate the counts. Here is a conceptual illustration. Suppose we have two overlapping intervals:
[1, 3], [2, 5]
We can first separate them to non-overlapping ones:
[1, 2), [2, 3), [3, 5]
Now, findInterval
tabulate
can find counts on these intervals:
[1, 2) | [2, 3) | [3, 5] |
---|---|---|
N1 | N2 | N3 |
The desired result is an aggregation:
[1, 3) | [2, 5] |
---|---|
N1 N2 | N2 N3 |
Remember: Nothing is faster than using a clever algorithm.