Let's say i have data frame in R that looks like this :
var = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3))
y = rnorm(15)
data = tibble(var,y);data
With output:
# A tibble: 15 x 2
var y
<chr> <dbl>
1 A -1.23
2 A -0.983
3 A 1.28
4 B -0.268
5 B -0.460
6 B -1.23
7 C 1.87
8 C 0.416
9 C -1.99
10 D 0.289
11 D 1.70
12 D -0.455
13 E -0.648
14 E 0.376
15 E -0.887
i want to calculate the correlation of each distinct pair in R using dplyr. Ideally i want to look like this (the third column to contain the values of each correlation pair):
var1 | var2 | value |
---|---|---|
A | B | cor(A,B) |
A | C | cor(A,C) |
A | D | cor(A,D) |
A | E | cor(A,E) |
B | C | cor(B,E) |
B | D | cor(B,E) |
B | E | cor(B,E) |
C | D | cor(C,E) |
C | E | cor(C,E) |
D | E | cor(D,E) |
How i can do that in R ? Any help ?
Additional
if i have another grouping variable say group2:
var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
rep("H",3),rep("I",3))
y2 = rnorm(24)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2);data2
which ideally must look like this :
group | var1 | var2 | value |
---|---|---|---|
1 | A | B | cor(A,B) |
1 | A | H | cor(A,H) |
1 | A | I | cor(A,I) |
1 | B | H | cor(B,H) |
1 | B | I | cor(B,I) |
1 | H | I | cor(H,I) |
2 | C | D | cor(C,D) |
3 | E | F | cor(E,F) |
How i can calculate each variable in column var2 on each group group2?
CodePudding user response:
Here is a one-liner via base R
data.frame(t(combn(unique(data$var), 2, function(i)
list(v1 = i[[1]],
v2 = i[[2]],
value = cor(data$y[data$var %in% i[[1]]],
data$y[data$var %in% i[[2]]])))))
X1 X2 X3
1 A B 0.997249
2 A C 0.7544987
3 A D -0.7924587
4 A E 0.03567887
5 B C 0.8010711
6 B D -0.7450683
7 B E 0.1096579
8 C D -0.1976141
9 C E 0.6828033
10 D E 0.5812632
CodePudding user response:
Another possible solution:
library(tidyverse)
df %>%
group_by(var) %>%
group_map(~ data.frame(.x) %>% set_names(.y)) %>%
bind_cols %>% cor %>%
{data.frame(row=rownames(.)[row(.)[upper.tri(.)]],
col=colnames(.)[col(.)[upper.tri(.)]],
corr=.[upper.tri(.)])}
#> row col corr
#> 1 A B -0.9949738
#> 2 A C -0.9574502
#> 3 B C 0.9815368
#> 4 A D -0.7039708
#> 5 B D 0.6293137
#> 6 C D 0.4690460
#> 7 A E -0.5755463
#> 8 B E 0.4907660
#> 9 C E 0.3150499
#> 10 D E 0.9859711
CodePudding user response:
1) Add an index column 1, 2, 3, 1, 2, 3, ... and then use read.zoo to convert from long to wide. Take the correlation reshape back to long form using as.data.frame.table and filter out the desired rows.
library(dplyr)
library(zoo)
DF %>%
mutate(index = sequence(rle(var)$lengths)) %>%
read.zoo(index = "index", split = "var") %>%
cor %>%
as.data.frame.table(responseName = "cor") %>%
filter(format(Var1) < format(Var2))
2) At the expense of one more line of code we can substitute pivot_wider for read.zoo.
library(dplyr)
library(tidyr)
DF %>%
mutate(index = sequence(rle(var)$lengths)) %>%
pivot_wider(index, names_from = "var", values_from = "y") %>%
select(-index) %>%
cor %>%
as.data.frame.table(responseName = "cor") %>%
filter(format(Var1) < format(Var2))
3) A base solution consists of using combn to get the pairs of var with the indicated function f.
co <- combn(unique(DF$var), 2)
f <- function(v) with(DF, data.frame(t(v), cor = cor(y[var==v[1]], y[var==v[2]])))
do.call("rbind", apply(co, 2, f))
Note
The input in reproducible form.
DF <-
structure(list(var = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "E", "E", "E"), y = c(-1.23, -0.983, 1.28,
-0.268, -0.46, -1.23, 1.87, 0.416, -1.99, 0.289, 1.7, -0.455,
-0.648, 0.376, -0.887)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))