I have a dataframe with following structure:
df <-
data.frame(
Replicate = c(rep("N1", 50), rep("N2", 50)),
feature1 = rnorm(100, 0, 1),
feature2 = rnorm(100, 0, 3),
feature3 = rnorm(100, 0.1, 1)
)
I am calculating the correlation between my (biological) replicates for each of my data columns (here "feature 1-3") with following code:
results_table <- data.frame(feature = NA,
correlation = NA)
for(i in colnames(df)[2:4]){
cor_i <- cor(df %>% filter(Replicate == "N1") %>% pull(i),
df %>% filter(Replicate == "N2") %>% pull(i),
use = "pairwise.complete")
results_table_temp <- data.frame(feature = i,
correlation = cor_i)
results_table <- rbind(results_table, results_table_temp)
}
results_table <- results_table[2:nrow(results_table),]
results_table
I basically filter my initial dataframe for the respective replicate and calculate correlation between these replicate for each column (using a a for loop with cor()
and store the output in dataframe).
For my dataset (240 rows with >7000 colums), the computing time is quite long! Is there a more efficient way to calculate this? Maybe a specific function or preprocessing of data to make the computation more efficient?
CodePudding user response:
There are a few ways to do this:
Here is a simple way using summarize(across())
from dplyr
df %>%
summarize(across(
-Replicate,
~cor(.x[Replicate=="N1"], .x[Replicate=="N2"], use ="pairwise.complete")
)) %>%
t()
Output:
[,1]
feature1 -0.035869831
feature2 -0.007740304
feature3 -0.051250907
However, because it uses dplyr
, it is unlikely to be very fast on your 7000 column dataset. At 7000x240, this is only ~1.7 million rows. Better to swing that table long, and group by feature.. Of course, use data.table for this, not dplyr. Here is an example with 7000 features
melt(dt, id="Replicate",variable.name = "feature")[
, .(correlation = cor(value[Replicate=="N1"], value[Replicate=="N2"], use="pairwise.complete")),
by = feature]
Output:
feature correlation
<fctr> <num>
1: V1 0.130108203
2: V2 -0.060648735
3: V3 0.109366966
4: V4 -0.279476904
5: V5 0.007624332
---
6996: V6996 -0.112960673
6997: V6997 0.022128826
6998: V6998 -0.041648409
6999: V6999 -0.052700939
7000: V7000 -0.194483569
Input:
set.seed(123)
dt = lapply(1:7000, \(x) rnorm(n=240)) %>% as.data.table()
dt[,Replicate:=c(rep("N1", 120), rep("N2", 120))]