I understand how to boostrap using the "boot" package in R, through the PDF for the package and also from these two examples on Stack, Bootstrapped correlation with more than 2 variables in R and Bootstrapped p-value for a correlation coefficient on R.
However, this is for small datasets ( 2 variables or a matrix with 5 variables). I have a very large matrix (1000 columns) and the code I use to compute the correlation between every metabolite pair (removing duplicate and correlations with the metabolite itself) is:
x <- colnames(dat)
GetCor = function(x,y) cor(dat[,x], dat[,y], method="spearman")
GetCor = Vectorize(GetCor)
out <- data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
mutate(v = GetCor(X1,X2))
I'm not sure how I can then alter this for it to be the function I pass to statistic in boot so
boot_res<- boot(dat, ?, R=1000)
Or would I just need to obtain a matrix of the bootstrapped p value or estimate depending on function code (colMeans(boot_res$t)) and get rid of the upper or lower triangle?
Was curious to know the most efficient way of going about the problem..
CodePudding user response:
Something like this? It follows more or less the same lines as my answer to the 2nd question you link to in your question.
Note that I have simplified the correlation code, cor
accepts a data.frame or a matrix, so pass a two column one and keep one of the off diagonal correlation matrix elements.
library(boot)
bootPairwiseCor <- function(data, i) {
d <- data[i,]
combn(d, 2, \(x) cor(x, method="spearman")[1,2])
}
dat <- iris[-5]
nms <- combn(colnames(dat), 2, paste, collapse = "_")
R <- 100L
b <- boot(dat, bootPairwiseCor, R)
b
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = dat, statistic = bootPairwiseCor, R = R)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.1667777 0.0037142908 0.070552718
#> t2* 0.8818981 -0.0002851683 0.017783297
#> t3* 0.8342888 0.0006306610 0.021509280
#> t4* -0.3096351 0.0047809612 0.075976067
#> t5* -0.2890317 0.0045689001 0.069929108
#> t6* 0.9376668 -0.0014838117 0.009632318
data.frame(variables = nms, correlations = colMeans(b$t))
#> variables correlations
#> 1 Sepal.Length_Sepal.Width -0.1630634
#> 2 Sepal.Length_Petal.Length 0.8816130
#> 3 Sepal.Length_Petal.Width 0.8349194
#> 4 Sepal.Width_Petal.Length -0.3048541
#> 5 Sepal.Width_Petal.Width -0.2844628
#> 6 Petal.Length_Petal.Width 0.9361830
Created on 2023-01-28 with reprex v2.0.2
CodePudding user response:
You may want to use cor.test
to get theoretical t-values. We will use them for comparison with the B bootstrap t-values. (Recall: The p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.)
Here is a similar function to yours, but applying cor.test
and extracting statistics.
corr_cmb <- \(X, boot=FALSE) {
stts <- c('estimate', 'statistic', 'p.value')
cmbn <- combn(colnames(X), 2, simplify=FALSE)
a <- lapply(cmbn, \(x) as.data.frame(cor.test(X[, x[1]], X[, x[2]])[stts])) |>
do.call(what=rbind) |>
`rownames<-`(sapply(cmbn, paste, collapse=':'))
if (boot) {
a <- a[, 'statistic']
}
a
}
We run it one times on the data to get a theoretical solution.
rhat <- corr_cmb(dat)
head(rhat, 3)
# estimate statistic p.value
# V1:V2 0.06780426 2.1469547 0.03203729
# V1:V3 0.03471587 1.0973752 0.27274212
# V1:V4 0.05301563 1.6771828 0.09381987
Bootstrap
We can assume from the start that the bootstrap with 1000 columns will run for a while (choose(1000, 2)
returns 499500 combinations). That's why we think about a multithreaded solution right away.
To bootstrap we simple repeatedly apply corr_cmb
repeatedly on a sample
of the data with replications.
We will measure the time to estimate the time needed for 1000 variables.
## setup clusters
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('corr_cmb', 'dat'))
t0 <- Sys.time() ## timestamp before run
B <- 1099L
clusterSetRNGStream(CL, 42)
boot_res <- parSapply(CL, 1:B, \(i) corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
t1 <- Sys.time() ## timestamp after run
stopCluster(CL)
After the bootstrap, we calculate the ratios of how many times the abs
olute bootstrap test statistics exceeded the theoretical ones (Ref.),
boot_p <- rowMeans(abs(boot_res - rowMeans(boot_res)) > abs(rhat$statistic))
and cbind
the bootstrap p-values to the theoretical result.
cbind(rhat, boot_p)
# estimate statistic p.value boot_p
# V1:V2 0.06780426 2.1469547 0.03203729 0.03003003
# V1:V3 0.03471587 1.0973752 0.27274212 0.28028028
# V1:V4 0.05301563 1.6771828 0.09381987 0.08208208
# V1:V5 -0.01018682 -0.3218300 0.74764890 0.73473473
# V2:V3 0.03730133 1.1792122 0.23859474 0.23323323
# V2:V4 0.07203911 2.2817257 0.02271539 0.01201201
# V2:V5 0.03098230 0.9792363 0.32770055 0.30530531
# V3:V4 0.02364486 0.7471768 0.45513283 0.47547548
# V3:V5 -0.02864165 -0.9051937 0.36558126 0.38938939
# V4:V5 0.03415689 1.0796851 0.28054328 0.29329329
Note that the data used is fairly normally distributed. If the data is not normally distributed, the bootstrap p-values will be more different.
To conclude, an estimate of the time needed for your 1000 variables.
d <- as.numeric(difftime(t1, t0, units='mins'))
n_est <- 1000
t_est <- d/(choose(m, 2))*choose(n_est, 2)
cat(sprintf('est. runtime for %s variables: %s mins\n', n_est, round(t_est, 1)))
# est. runtime for 1000 variables: 1485.8 mins
(Perhaps for sake of completeness, a single-threaded version for smaller problems:)
## singlethreaded version
# set.seed(42)
# B <- 1099L
# boot_res <- replicate(B, corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
Data:
library(MASS)
n <- 1e3; m <- 5
Sigma <- matrix(.5, m, m)
diag(Sigma) <- 1
set.seed(42)
M <- mvrnorm(n, runif(m), Sigma)
M <- M rnorm(length(M), sd=6)
dat <- as.data.frame(M)