R fast quantile calculation for confidence intervals-CodePudding

I am trying to calculate quantiles for every "slice" of a dataset, in order to get some kind of "confidence intervals" at a 99% level. I manage this with base R, but it is excruciatingly slow. Any idea to speed up, or for a better approach, is welcome.

a <- (1:20000)/100
b <- 20001:40000

speedseq <- data.frame(a, b)
work_quantile <- rep(NA, nrow(speedseq))
    
myfunc <- function() {
  

  for(i in 1: nrow(speedseq)) {
    work_quantile[i] <- quantile(speedseq$b[speedseq$a>=(speedseq$a[i] - 1) & 
                                                       speedseq$a<=speedseq$a[i]], na.rm = T, probs = 0.99)
    if(i%000==0) print(round(i/nrow(speedseq),3))
  }
  mean(is.na(work_quantile))

}

microbenchmark::microbenchmark(myfunc(), times = 1)
   Unit: seconds
     expr      min       lq     mean   median       uq      max neval
 myfunc() 5.185645 5.185645 5.185645 5.185645 5.185645 5.185645     1

CodePudding user response：

You could parallelize it.

library(parallel)

cl <- makeCluster(detectCores() - 1)
clusterExport(cl, 'speedseq')

r0 <- parSapply(cl, 1:nrow(speedseq), \(i) unname(quantile(speedseq$b[speedseq$a >= (speedseq$a[i] - 1) & 
                                                    speedseq$a <= speedseq$a[i]], na.rm=T, probs=0.99)))
stopifnot(all.equal(work_quantile, r0))

stopCluster(cl)

Other approaches:

If you want slices of 1, 1:2, 1:3, ... 1:nrow, you could do

r1 <- vapply(1:nrow(speedseq), \(x) quantile(speedseq$b[seq.int(1, x)], .99), numeric(1))

If you want 1:100, 2:101, 2:102, ..., (nrow - 99):nrow, you could do

r2 <- vapply(1:(nrow(speedseq) - 99), \(x) quantile(speedseq$b[seq.int(x, x   99)], .99), numeric(1))

If you want just slices of 100 each you could do

r3 <- vapply(seq(1, nrow(speedseq), 100), \(x) quantile(speedseq$b[seq.int(x, x   99)], .99), numeric(1))

CodePudding user response：

An approach using data.table which uses the data.table non-equi join: (not sure how memory efficient it will be but it took 4 seconds to run locally)

For each row, pre-define the upper bound, lower bound and row id:

library(data.table)
setDT(speedseq)

speedseq[, upper_bound := a]
speedseq[, lower_bound := a - 1]
speedseq[, row_id := .I]

Do a non-equi on itself. Read as:

In the first line, let's call the speedseq on the right i, and the speedseq on the left X
This is your data.table merge syntax which means "for each row in i, find the rows that match from X
The match condition is that X.a >= i.lower_bound and X.a <= i.upper_bound. Again, taking the lower_bound and upper_bound from one row of i, this is saying "give me all rows in X where a is between these bounds
The .(row_id = i.row_id, a = i.a, b) specifies what I want to keep, meaning I want to keep the row_id and a from i, and the values of b that came from X

Lastly, compute the required quantile by row_id

  speedseq[speedseq,
    .(row_id = i.row_id, a = i.a, b),
    # Non equi join
    on = .(a >= lower_bound,
           a <= upper_bound)
  ][, .(work_quantile = quantile(b, na.rm = T, probs = 0.99)), by = .(row_id)]