How to optimize my correlation problem in R?-CodePudding

I have three dataframes in R, let's call them A, B, and C. dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:

C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))

I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:

for (i in 1:nrow(C)) { 
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}

The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:

A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A)) 
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])

CodePudding user response：

Based on the comment by @HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:

library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
  a<- C[i,1]
  b<- C[i,2]
  correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
  c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res

CodePudding user response：

Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.

Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.

# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
  tval <- sqrt(dof) * r/sqrt(1 - r^2)
  min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}

# faster version of cor.test
cor_test1 <- function(x, y) {
  r <- cor(x, y)
  dof <- length(x) - 2
  tval <- sqrt(dof) * r/sqrt(1 - r^2)
  pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
  c(r, pval)
}

# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
 r <- cor(x, y)
 c(r, pvals[100 * round(r, 2)   101])
}

# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)

library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))

giving:

Unit: microseconds
                   expr   min    lq    mean median     uq     max neval cld
         cor.test(x, y) 253.7 256.7 346.278 266.05 501.45   650.6   100   a
        cor_test1(x, y)  84.8  87.2 346.777  89.10 107.40 22974.4   100   a
 cor_test2(x, y, pvals)  72.4  75.0 272.030  79.45  91.25 17935.8   100   a