Find duplicate values that have same set of observations-CodePudding

I have data in below format:

Profile	Group
A	Grp1
A	Grp2
A	Grp3
B	Grp7
C	Grp32
B	Grp32
C	Grp7
D	Grp16
H	Grp19
M	Grp1
M	Grp2
M	Grp3

I would like to find out the profiles (1st column) that have same set of Groups (2nd column).

In the sample data above Profiles "A" and "M" have same set of groups (Grp1, Grp2, Grp3) and profiles "B" and "C" have same set of groups (Grp7, Grp32). Hence, the resulting data should be:

Profile	Group
A	Grp1
A	Grp2
A	Grp3
M	Grp1
M	Grp2
M	Grp3
B	Grp7
B	Grp32
C	Grp7
C	Grp32

Would appreciate any help.

CodePudding user response：

A dplyr solution:

library(dplyr)

df %>%
  group_by(Profile) %>%
  summarise(Group = toString(sort(Group))) %>%
  group_by(Group) %>%
  filter(n() > 1) %>%
  arrange(.by_group = TRUE) %>%
  tidyr::separate_rows(Group)

# # A tibble: 10 × 2
#    Profile Group
#    <chr>   <chr>
#  1 A       Grp1
#  2 A       Grp2
#  3 A       Grp3
#  4 M       Grp1
#  5 M       Grp2
#  6 M       Grp3
#  7 B       Grp32
#  8 B       Grp7
#  9 C       Grp32
# 10 C       Grp7

Data

df <- structure(list(Profile = c("A", "A", "A", "B", "C", "B", "C", 
"D", "H", "M", "M", "M"), Group = c("Grp1", "Grp2", "Grp3", "Grp7",
"Grp32", "Grp32", "Grp7", "Grp16", "Grp19", "Grp1", "Grp2", "Grp3"
)), class = "data.frame", row.names = c(NA, -12L))