I am trying to calculate quantiles for every "slice" of a dataset, in order to get some kind of "confidence intervals" at a 99% level. I manage this with base R, but it is excruciatingly slow. Any idea to speed up, or for a better approach, is welcome.
a <- (1:20000)/100
b <- 20001:40000
speedseq <- data.frame(a, b)
work_quantile <- rep(NA, nrow(speedseq))
myfunc <- function() {
for(i in 1: nrow(speedseq)) {
work_quantile[i] <- quantile(speedseq$b[speedseq$a>=(speedseq$a[i] - 1) &
speedseq$a<=speedseq$a[i]], na.rm = T, probs = 0.99)
if(i%000==0) print(round(i/nrow(speedseq),3))
}
mean(is.na(work_quantile))
}
microbenchmark::microbenchmark(myfunc(), times = 1)
Unit: seconds
expr min lq mean median uq max neval
myfunc() 5.185645 5.185645 5.185645 5.185645 5.185645 5.185645 1
CodePudding user response:
You could parallelize it.
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, 'speedseq')
r0 <- parSapply(cl, 1:nrow(speedseq), \(i) unname(quantile(speedseq$b[speedseq$a >= (speedseq$a[i] - 1) &
speedseq$a <= speedseq$a[i]], na.rm=T, probs=0.99)))
stopifnot(all.equal(work_quantile, r0))
stopCluster(cl)
Other approaches:
If you want slices of 1, 1:2, 1:3, ... 1:nrow, you could do
r1 <- vapply(1:nrow(speedseq), \(x) quantile(speedseq$b[seq.int(1, x)], .99), numeric(1))
If you want 1:100, 2:101, 2:102, ..., (nrow - 99):nrow, you could do
r2 <- vapply(1:(nrow(speedseq) - 99), \(x) quantile(speedseq$b[seq.int(x, x 99)], .99), numeric(1))
If you want just slices of 100 each you could do
r3 <- vapply(seq(1, nrow(speedseq), 100), \(x) quantile(speedseq$b[seq.int(x, x 99)], .99), numeric(1))
CodePudding user response:
An approach using data.table
which uses the data.table
non-equi join: (not sure how memory efficient it will be but it took 4 seconds to run locally)
For each row, pre-define the upper bound, lower bound and row id:
library(data.table)
setDT(speedseq)
speedseq[, upper_bound := a]
speedseq[, lower_bound := a - 1]
speedseq[, row_id := .I]
Do a non-equi on itself. Read as:
In the first line, let's call the
speedseq
on the righti
, and thespeedseq
on the leftX
This is your data.table merge syntax which means "for each row in
i
, find the rows that match fromX
The match condition is that
X.a >= i.lower_bound
andX.a <= i.upper_bound
. Again, taking thelower_bound
andupper_bound
from one row ofi
, this is saying "give me all rows in X wherea
is between these boundsThe
.(row_id = i.row_id, a = i.a, b)
specifies what I want to keep, meaning I want to keep therow_id
anda
fromi
, and the values ofb
that came fromX
Lastly, compute the required quantile by
row_id
speedseq[speedseq, .(row_id = i.row_id, a = i.a, b), # Non equi join on = .(a >= lower_bound, a <= upper_bound) ][, .(work_quantile = quantile(b, na.rm = T, probs = 0.99)), by = .(row_id)]