How i can calculate the correlation between groups in R using dplyr?-CodePudding

Let's say i have data frame in R that looks like this :

var = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3))
y = rnorm(15)
data = tibble(var,y);data

With output:

# A tibble: 15 x 2
   var        y
   <chr>  <dbl>
 1 A     -1.23 
 2 A     -0.983
 3 A      1.28 
 4 B     -0.268
 5 B     -0.460
 6 B     -1.23 
 7 C      1.87 
 8 C      0.416
 9 C     -1.99 
10 D      0.289
11 D      1.70 
12 D     -0.455
13 E     -0.648
14 E      0.376
15 E     -0.887

i want to calculate the correlation of each distinct pair in R using dplyr. Ideally i want to look like this (the third column to contain the values of each correlation pair):

var1	var2	value
A	B	cor(A,B)
A	C	cor(A,C)
A	D	cor(A,D)
A	E	cor(A,E)
B	C	cor(B,E)
B	D	cor(B,E)
B	E	cor(B,E)
C	D	cor(C,E)
C	E	cor(C,E)
D	E	cor(D,E)

How i can do that in R ? Any help ?

Additional

if i have another grouping variable say group2:

var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
        rep("H",3),rep("I",3))

y2 = rnorm(24)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2);data2

which ideally must look like this :

group	var1	var2	value
1	A	B	cor(A,B)
1	A	H	cor(A,H)
1	A	I	cor(A,I)
1	B	H	cor(B,H)
1	B	I	cor(B,I)
1	H	I	cor(H,I)
2	C	D	cor(C,D)
3	E	F	cor(E,F)

How i can calculate each variable in column var2 on each group group2?

CodePudding user response：

Here is a one-liner via base R

data.frame(t(combn(unique(data$var), 2, function(i)
                     list(v1 = i[[1]], 
                          v2 = i[[2]], 
                          value = cor(data$y[data$var %in% i[[1]]], 
                                      data$y[data$var %in% i[[2]]])))))

   X1 X2         X3
1   A  B   0.997249
2   A  C  0.7544987
3   A  D -0.7924587
4   A  E 0.03567887
5   B  C  0.8010711
6   B  D -0.7450683
7   B  E  0.1096579
8   C  D -0.1976141
9   C  E  0.6828033
10  D  E  0.5812632

CodePudding user response：

Another possible solution:

library(tidyverse)

df %>% 
  group_by(var) %>% 
  group_map(~ data.frame(.x) %>% set_names(.y)) %>% 
  bind_cols %>% cor %>% 
  {data.frame(row=rownames(.)[row(.)[upper.tri(.)]], 
              col=colnames(.)[col(.)[upper.tri(.)]], 
              corr=.[upper.tri(.)])}

#>    row col       corr
#> 1    A   B -0.9949738
#> 2    A   C -0.9574502
#> 3    B   C  0.9815368
#> 4    A   D -0.7039708
#> 5    B   D  0.6293137
#> 6    C   D  0.4690460
#> 7    A   E -0.5755463
#> 8    B   E  0.4907660
#> 9    C   E  0.3150499
#> 10   D   E  0.9859711

CodePudding user response：

1) Add an index column 1, 2, 3, 1, 2, 3, ... and then use read.zoo to convert from long to wide. Take the correlation reshape back to long form using as.data.frame.table and filter out the desired rows.

library(dplyr)
library(zoo)

DF %>%
  mutate(index = sequence(rle(var)$lengths)) %>%
  read.zoo(index = "index", split = "var") %>%
  cor %>%
  as.data.frame.table(responseName = "cor") %>%
  filter(format(Var1) < format(Var2))

2) At the expense of one more line of code we can substitute pivot_wider for read.zoo.

library(dplyr)
library(tidyr)

DF %>%
  mutate(index = sequence(rle(var)$lengths)) %>%
  pivot_wider(index, names_from = "var", values_from = "y") %>%
  select(-index) %>%
  cor %>%
  as.data.frame.table(responseName = "cor") %>%
  filter(format(Var1) < format(Var2))

3) A base solution consists of using combn to get the pairs of var with the indicated function f.

co <- combn(unique(DF$var), 2)
f <- function(v) with(DF, data.frame(t(v), cor = cor(y[var==v[1]], y[var==v[2]])))
do.call("rbind", apply(co, 2, f))

Note

The input in reproducible form.

DF <-
structure(list(var = c("A", "A", "A", "B", "B", "B", "C", "C", 
"C", "D", "D", "D", "E", "E", "E"), y = c(-1.23, -0.983, 1.28, 
-0.268, -0.46, -1.23, 1.87, 0.416, -1.99, 0.289, 1.7, -0.455, 
-0.648, 0.376, -0.887)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))