Home > database >  Spearman correlation from matrix
Spearman correlation from matrix

Time:10-19

I have a data frame that looks like this:

Samples   GENE GEN1  GEN2 GEN3   GEN4 GEN5
Sample1     21.0  160 110 3.90 2.62 16.5
Sample2  21.0  160 110 3.90 2.88 17.0
Sample3    22.8  108  5 3.85 2.32 18.6

What I pretend is to perform a spearman correlation using the column GENE vs each other columns, ie GENEvsGEN1, GENEvsGEN2, etc. And obtain a table with the rho and p-values of each column (the results are invented):

             rho     p-value
GENEvsGEN1    0.01193936     0.34
GENEvsGEN2 0.0113436     0.034

I know I can obtain it individually using:

res2 <-cor.test(data$GENE, data$GEN1,  method = "spearman")
res2

But I have almost 5000 columns, so doing manually each one is not viable.

Any suggestions about how I can handle it? Thanks!!

CodePudding user response:

An option could be using the corr.test function from the psych package which makes it possible to have a matrix as input to compute the test with one column with all others like this:

# Remove non numeric column
df = df[,-1]
# Correlation test
library(psych)
corr.test(df[, colnames(df) != "GENE"], df$GENE, method = 'spearman')
#> Warning in corr.test(df[, colnames(df) != "GENE"], df$GENE, method =
#> "spearman"): Number of subjects must be greater than 3 to find confidence
#> intervals.
#> Call:corr.test(x = df[, colnames(df) != "GENE"], y = df$GENE, method = "spearman")
#> Correlation matrix 
#>       [,1]
#> GEN1 -1.00
#> GEN2 -1.00
#> GEN3 -1.00
#> GEN4 -0.87
#> GEN5  0.87
#> Sample Size 
#> [1] 3
#> These are the unadjusted probability values.
#>   The probability values  adjusted for multiple tests are in the p.adj object. 
#>      [,1]
#> GEN1 0.00
#> GEN2 0.00
#> GEN3 0.00
#> GEN4 0.33
#> GEN5 0.33
#> 
#>  To see confidence intervals of the correlations, print with the short=FALSE option

Created on 2022-10-19 with reprex v2.0.2

CodePudding user response:

You could use a for loop in order to store each comparison, as follows

### Initiating data
data <- data.frame(GENE=rnorm(100),
                   GEN1=rnorm(100),
                   GEN2=rnorm(100),
                   GEN3=rnorm(100))

### Initiating empty dataframe for storing results
matCorrSpearman <- data.frame(gene=paste("GENE vs ", colnames(data)[-1]), rho=rep(NA, ncol(data)-1), pvalue=NA)

### Storing results through a loop
for(i in 2:ncol(data)){
  res <- cor.test(data[, "GENE"], data[, i], method="spearman")
  matCorrSpearman$rho[i-1] <- res$estimate
  matCorrSpearman$pvalue[i-1] <- res$p.value
}

### Display results
matCorrSpearman

Results

           gene        rho    pvalue
1 GENE vs  GEN1 0.10882688 0.2806195
2 GENE vs  GEN2 0.05983798 0.5536572
3 GENE vs  GEN3 0.15259526 0.1294692

CodePudding user response:

Here is one potential solution using the expand.grid() function to generate the different combination of variables (e.g. GENE,GEN1 & GENE,GEN2 etc), then using the pmap() function from the purrr package to evaluate each of the combinations. The cor.test function requires two numeric vectors as input, so you can use pull() to extract the 'variables of interest' from the dataframe and pass them to the function. Then, the tidy() function from the broom package is used to format the output, and the required columns are selected.

library(tidyverse)
library(broom)

df <- read.table(text = "Samples   GENE GEN1  GEN2 GEN3   GEN4 GEN5
Sample1     21.0  160 110 3.90 2.62 16.5
Sample2  21.0  160 110 3.90 2.88 17.0
Sample3    22.8  108  5 3.85 2.32 18.6",
header = TRUE)

groupings <- expand.grid("GENE", names(df)[-c(1,2)])
groupings
#>   Var1 Var2
#> 1 GENE GEN1
#> 2 GENE GEN2
#> 3 GENE GEN3
#> 4 GENE GEN4
#> 5 GENE GEN5

# cor.test requires two numeric vectors,
# so a workaround with pmap() is to use pull()
result <- pmap(groupings, ~cor.test(df %>% pull(.x),
                                    df %>% pull(.y),
                                    method = "spearman")) %>% 
  map_df(tidy) %>%
  cbind(groupings, .) %>%
  rename("rho" = "estimate") %>%
  select(Var1, Var2, rho, p.value)

result
#>   Var1 Var2        rho   p.value
#> 1 GENE GEN1 -1.0000000 0.0000000
#> 2 GENE GEN2 -1.0000000 0.0000000
#> 3 GENE GEN3 -1.0000000 0.0000000
#> 4 GENE GEN4 -0.8660254 0.3333333
#> 5 GENE GEN5  0.8660254 0.3333333

Created on 2022-10-19 by the reprex package (v2.0.1)

  • Related