I have a data.frame that looks somewhat like this:
df <- data.frame (names = LETTERS[1:10],
rep1 = sample(1:5, 10, replace=TRUE),
rep2 = sample(1:5, 10, replace=TRUE),
rep3 = sample(1:5, 10, replace=TRUE),
rep4= sample(1:5, 10, replace=TRUE))
print(df)
names rep1 rep2 rep3 rep4
1 A 2 2 5 4
2 B 5 5 5 1
3 C 3 4 2 5
4 D 5 3 5 3
5 E 2 3 2 4
6 F 5 5 2 4
7 G 1 3 1 3
8 H 2 2 3 3
9 I 1 1 4 3
10 J 3 1 3 5
What I need to know: Are some of the names ('samples') grouped together (by numbers) in the different reps?
However, it does not matter if the numbers (1 to 5) are different, only if specific names belong to the same group (e.g. A, E, H belong to group 2 in rep1. Are they grouped together in another rep?). I want to know if there is a 'pattern' of groupings, e.g. are some names occurring together/ in a set more often?
Does anyone have an idea how to achieve this?
CodePudding user response:
Perhaps this one helps you finding a pattern:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-names) %>%
group_by(name, value) %>%
summarise(grouping = paste(names, collapse = ", "),
.groups = "drop") %>%
pivot_wider(names_from = name,
values_from = grouping)
This returns
# A tibble: 5 x 5
value rep1 rep2 rep3 rep4
<int> <chr> <chr> <chr> <chr>
1 1 D, E, J NA I A, C, E
2 2 A, B F, H A, C, D, F G
3 4 F, H D, E H D, H, I
4 5 C, G, I A, I, J B, J B, F
5 3 NA B, C, G E, G J
where value
is the original group from the reps.
Data
structure(list(names = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"), rep1 = c(2L, 2L, 5L, 1L, 1L, 4L, 5L, 4L, 5L, 1L),
rep2 = c(5L, 3L, 3L, 4L, 4L, 2L, 3L, 2L, 5L, 5L), rep3 = c(2L,
5L, 2L, 2L, 3L, 2L, 3L, 4L, 1L, 5L), rep4 = c(1L, 5L, 1L,
4L, 1L, 5L, 2L, 4L, 4L, 3L)), class = "data.frame", row.names = c(NA,
-10L))
CodePudding user response:
Here is a solution returning the the maximum overlap per rep*
.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-names, names_to = "rep") %>%
group_by(rep, value) %>%
summarise(n = n(),
names = paste(names, collapse = ", ")) %>%
filter(n == max(n))
#`summarise()` has grouped output by 'name'. You can #override using the `.groups` argument.
## A tibble: 7 x 4
## Groups: name [4]
# rep value n names
# <chr> <int> <int> <chr>
#1 rep1 4 4 B, C, G, I
#2 rep2 3 3 A, D, I
#3 rep2 4 3 B, F, J
#4 rep3 2 3 D, G, H
#5 rep3 3 3 E, F, J
#6 rep3 5 3 A, B, I
#7 rep4 1 3 B, C, J
Data
The test data creation code is repeated from the question but with the pseudo-RNG seed set, in order to make the results reproducible.
set.seed(2021)
df <- data.frame (names = LETTERS[1:10],
rep1 = sample(1:5, 10, replace=TRUE),
rep2 = sample(1:5, 10, replace=TRUE),
rep3 = sample(1:5, 10, replace=TRUE),
rep4= sample(1:5, 10, replace=TRUE))