I have a data frame that looks like this:
Samples GENE GEN1 GEN2 GEN3 GEN4 GEN5
Sample1 21.0 160 110 3.90 2.62 16.5
Sample2 21.0 160 110 3.90 2.88 17.0
Sample3 22.8 108 5 3.85 2.32 18.6
What I pretend is to perform a spearman correlation using the column GENE
vs each other columns, ie GENEvsGEN1, GENEvsGEN2, etc.
And obtain a table with the rho and p-values of each column (the results are invented):
rho p-value
GENEvsGEN1 0.01193936 0.34
GENEvsGEN2 0.0113436 0.034
I know I can obtain it individually using:
res2 <-cor.test(data$GENE, data$GEN1, method = "spearman")
res2
But I have almost 5000 columns, so doing manually each one is not viable.
Any suggestions about how I can handle it? Thanks!!
CodePudding user response:
An option could be using the corr.test
function from the psych
package which makes it possible to have a matrix as input to compute the test with one column with all others like this:
# Remove non numeric column
df = df[,-1]
# Correlation test
library(psych)
corr.test(df[, colnames(df) != "GENE"], df$GENE, method = 'spearman')
#> Warning in corr.test(df[, colnames(df) != "GENE"], df$GENE, method =
#> "spearman"): Number of subjects must be greater than 3 to find confidence
#> intervals.
#> Call:corr.test(x = df[, colnames(df) != "GENE"], y = df$GENE, method = "spearman")
#> Correlation matrix
#> [,1]
#> GEN1 -1.00
#> GEN2 -1.00
#> GEN3 -1.00
#> GEN4 -0.87
#> GEN5 0.87
#> Sample Size
#> [1] 3
#> These are the unadjusted probability values.
#> The probability values adjusted for multiple tests are in the p.adj object.
#> [,1]
#> GEN1 0.00
#> GEN2 0.00
#> GEN3 0.00
#> GEN4 0.33
#> GEN5 0.33
#>
#> To see confidence intervals of the correlations, print with the short=FALSE option
Created on 2022-10-19 with reprex v2.0.2
CodePudding user response:
You could use a for
loop in order to store each comparison, as follows
### Initiating data
data <- data.frame(GENE=rnorm(100),
GEN1=rnorm(100),
GEN2=rnorm(100),
GEN3=rnorm(100))
### Initiating empty dataframe for storing results
matCorrSpearman <- data.frame(gene=paste("GENE vs ", colnames(data)[-1]), rho=rep(NA, ncol(data)-1), pvalue=NA)
### Storing results through a loop
for(i in 2:ncol(data)){
res <- cor.test(data[, "GENE"], data[, i], method="spearman")
matCorrSpearman$rho[i-1] <- res$estimate
matCorrSpearman$pvalue[i-1] <- res$p.value
}
### Display results
matCorrSpearman
Results
gene rho pvalue
1 GENE vs GEN1 0.10882688 0.2806195
2 GENE vs GEN2 0.05983798 0.5536572
3 GENE vs GEN3 0.15259526 0.1294692
CodePudding user response:
Here is one potential solution using the expand.grid()
function to generate the different combination of variables (e.g. GENE,GEN1 & GENE,GEN2 etc), then using the pmap()
function from the purrr package to evaluate each of the combinations. The cor.test function requires two numeric vectors as input, so you can use pull()
to extract the 'variables of interest' from the dataframe and pass them to the function. Then, the tidy()
function from the broom package is used to format the output, and the required columns are selected.
library(tidyverse)
library(broom)
df <- read.table(text = "Samples GENE GEN1 GEN2 GEN3 GEN4 GEN5
Sample1 21.0 160 110 3.90 2.62 16.5
Sample2 21.0 160 110 3.90 2.88 17.0
Sample3 22.8 108 5 3.85 2.32 18.6",
header = TRUE)
groupings <- expand.grid("GENE", names(df)[-c(1,2)])
groupings
#> Var1 Var2
#> 1 GENE GEN1
#> 2 GENE GEN2
#> 3 GENE GEN3
#> 4 GENE GEN4
#> 5 GENE GEN5
# cor.test requires two numeric vectors,
# so a workaround with pmap() is to use pull()
result <- pmap(groupings, ~cor.test(df %>% pull(.x),
df %>% pull(.y),
method = "spearman")) %>%
map_df(tidy) %>%
cbind(groupings, .) %>%
rename("rho" = "estimate") %>%
select(Var1, Var2, rho, p.value)
result
#> Var1 Var2 rho p.value
#> 1 GENE GEN1 -1.0000000 0.0000000
#> 2 GENE GEN2 -1.0000000 0.0000000
#> 3 GENE GEN3 -1.0000000 0.0000000
#> 4 GENE GEN4 -0.8660254 0.3333333
#> 5 GENE GEN5 0.8660254 0.3333333
Created on 2022-10-19 by the reprex package (v2.0.1)