Optimizing correlation calculation between (biological) replicates in R-CodePudding

I have a dataframe with following structure:

df <- 
  data.frame(
    Replicate = c(rep("N1", 50), rep("N2", 50)),
    feature1 = rnorm(100, 0, 1),
    feature2 = rnorm(100, 0, 3),
    feature3 = rnorm(100, 0.1, 1)  
)

I am calculating the correlation between my (biological) replicates for each of my data columns (here "feature 1-3") with following code:

results_table <- data.frame(feature = NA,
                            correlation = NA)

for(i in colnames(df)[2:4]){
  cor_i <- cor(df %>% filter(Replicate == "N1") %>% pull(i), 
               df %>% filter(Replicate == "N2") %>% pull(i), 
               use = "pairwise.complete")
  
  results_table_temp <- data.frame(feature = i,
                                   correlation = cor_i)
  
  results_table <- rbind(results_table, results_table_temp)
}

results_table <- results_table[2:nrow(results_table),]
results_table

I basically filter my initial dataframe for the respective replicate and calculate correlation between these replicate for each column (using a a for loop with cor() and store the output in dataframe).

For my dataset (240 rows with >7000 colums), the computing time is quite long! Is there a more efficient way to calculate this? Maybe a specific function or preprocessing of data to make the computation more efficient?

CodePudding user response：

There are a few ways to do this:

Here is a simple way using summarize(across()) from dplyr

df %>% 
  summarize(across(
    -Replicate,
    ~cor(.x[Replicate=="N1"], .x[Replicate=="N2"], use ="pairwise.complete")
  )) %>% 
  t()

Output:

                 [,1]
feature1 -0.035869831
feature2 -0.007740304
feature3 -0.051250907

However, because it uses dplyr, it is unlikely to be very fast on your 7000 column dataset. At 7000x240, this is only ~1.7 million rows. Better to swing that table long, and group by feature.. Of course, use data.table for this, not dplyr. Here is an example with 7000 features

melt(dt, id="Replicate",variable.name = "feature")[
  , .(correlation = cor(value[Replicate=="N1"], value[Replicate=="N2"], use="pairwise.complete")),
  by = feature]

Output:

      feature  correlation
       <fctr>        <num>
   1:      V1  0.130108203
   2:      V2 -0.060648735
   3:      V3  0.109366966
   4:      V4 -0.279476904
   5:      V5  0.007624332
  ---                     
6996:   V6996 -0.112960673
6997:   V6997  0.022128826
6998:   V6998 -0.041648409
6999:   V6999 -0.052700939
7000:   V7000 -0.194483569

Input:

set.seed(123)
dt = lapply(1:7000, \(x) rnorm(n=240)) %>% as.data.table()
dt[,Replicate:=c(rep("N1", 120), rep("N2", 120))]