Home > database >  Correlate two variables between two dataframe with different sample size
Correlate two variables between two dataframe with different sample size

Time:12-25

I would like to make a correlation between two dataframes.

Let me show an example: Here are my two dataframes

set.seed(123) 
v1<- c(rep("a", 4), rep("b", 3))
v2<- c(rnorm(4), rnorm(3))

df1 <- data.frame(v1, v2)

v1<- c(rep("a", 3), rep("b", 5))
v2<- c(rnorm(3), rnorm(5))

df2 <- data.frame(v1, v2)

I would like to correlate "a" modalities between df1 and df2 and do the same for "b" and so on. I could do it manually but I have hundreds of different modalities in my variable v1.

Another issue is sometimes I have different sizes between both dataframes.

For instance, in df1 we have "a" repeated four times while in df2 it is repeated three times.

I would like to correlate always based on the dataframe having the minimum modality size and discard the other on the dataframe having more like in this stackoverflow example here Let me know if you need more details

CodePudding user response:

To calculate the correlation between the "a" modalities in df1 and df2, you can first subset the data for each dataframe by the values in the v1 column:

df1_a <- df1[df1$v1 == "a", ]
df2_a <- df2[df2$v1 == "a", ]

Then, you can calculate the correlation using the cor() function, which calculates the Pearson correlation coefficient between two variables:

cor(df1_a$v2, df2_a$v2)

If the sizes of the data frames are different, you can take the minimum of the two sizes and use that to subset the data before calculating the correlation:

n <- min(nrow(df1_a), nrow(df2_a))
cor(df1_a$v2[1:n], df2_a$v2[1:n])

To do this for all modalities in the v1 column, you can use a for loop to iterate over all unique values in the v1 column and calculate the correlations for each modality:

modalities <- unique(df1$v1)

correlations <- list()

for (modality in modalities) {
  df1_subset <- df1[df1$v1 == modality, ]
  df2_subset <- df2[df2$v1 == modality, ]
  n <- min(nrow(df1_subset), nrow(df2_subset))
  correlations[[modality]] <- cor(df1_subset$v2[1:n], df2_subset$v2[1:n])
}

correlations

This will create a list of correlations for each modality, with the modality names as the names of the list elements.

I hope this helps!

CodePudding user response:

This is one approach using dplyr

  • First attach a group to the list of both data frames.
  • Then get the group size for later filtering of row numbers in groups.
  • Finally get the correlation for the matching groups.
library(dplyr)

bind_rows(df1 %>% mutate(grp = 1), df2 %>% mutate(grp = 2)) %>% 
  group_by(v1, grp) %>% 
  mutate(size = n(), rown = row_number()) %>% 
  group_by(v1) %>% 
  mutate(minsize = min(size)) %>% 
  filter(rown <= minsize) %>% 
  summarize(corr = cor(v2[grp == 1], v2[grp == 2]))
# A tibble: 2 × 2
  v1      corr
  <chr>  <dbl>
1 a      0.819
2 b     -0.693
  • Related