Home > Software engineering >  Find duplicate values that have same set of observations
Find duplicate values that have same set of observations

Time:09-20

I have data in below format:

Profile Group
A Grp1
A Grp2
A Grp3
B Grp7
C Grp32
B Grp32
C Grp7
D Grp16
H Grp19
M Grp1
M Grp2
M Grp3

I would like to find out the profiles (1st column) that have same set of Groups (2nd column).

In the sample data above Profiles "A" and "M" have same set of groups (Grp1, Grp2, Grp3) and profiles "B" and "C" have same set of groups (Grp7, Grp32). Hence, the resulting data should be:

Profile Group
A Grp1
A Grp2
A Grp3
M Grp1
M Grp2
M Grp3
B Grp7
B Grp32
C Grp7
C Grp32

Would appreciate any help.

CodePudding user response:

A dplyr solution:

library(dplyr)

df %>%
  group_by(Profile) %>%
  summarise(Group = toString(sort(Group))) %>%
  group_by(Group) %>%
  filter(n() > 1) %>%
  arrange(.by_group = TRUE) %>%
  tidyr::separate_rows(Group)

# # A tibble: 10 × 2
#    Profile Group
#    <chr>   <chr>
#  1 A       Grp1
#  2 A       Grp2
#  3 A       Grp3
#  4 M       Grp1
#  5 M       Grp2
#  6 M       Grp3
#  7 B       Grp32
#  8 B       Grp7
#  9 C       Grp32
# 10 C       Grp7
Data
df <- structure(list(Profile = c("A", "A", "A", "B", "C", "B", "C", 
"D", "H", "M", "M", "M"), Group = c("Grp1", "Grp2", "Grp3", "Grp7",
"Grp32", "Grp32", "Grp7", "Grp16", "Grp19", "Grp1", "Grp2", "Grp3"
)), class = "data.frame", row.names = c(NA, -12L))
  •  Tags:  
  • r
  • Related