Home > other >  Which groups have exactly the same rows
Which groups have exactly the same rows

Time:08-12

If I have a data frame like the following

group1 group2 col1 col2
A 1 ABC 5
A 1 DEF 2
B 1 AB 1
C 1 ABC 5
C 1 DEF 2
A 2 BC 8
B 2 AB 1

We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).

So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:

category group1 group2
1 A 1
1 C 1
2 B 1
2 B 2
3 A 2

And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.

I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks

CodePudding user response:

You can use pivot_wider first to handle identical groups over multiple rows.

library(tidyverse)

df %>% 
  group_by(group1, group2) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(names_from = id, values_from = c(col1, col2)) %>% 
  group_by(across(-c(group1, group2))) %>% 
  mutate(category = cur_group_id()) %>% 
  ungroup() %>% 
  select(category, group1, group2) %>% 
  arrange(category)
  category group1 group2
     <int> <chr>   <int>
1        1 B           1
2        1 B           2
3        2 A           1
4        2 C           1
5        3 A           2

CodePudding user response:

You could first group_by "col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id like this:

library(dplyr)
library(tidyr)
df %>%
  group_by(col1, col2) %>%
  filter(n() != 1) %>%
  mutate(ID = cur_group_id()) %>%
  ungroup() %>%
  select(-starts_with("col"))
#> # A tibble: 6 × 3
#>   group1 group2    ID
#>   <chr>   <int> <int>
#> 1 A           1     2
#> 2 A           1     3
#> 3 B           1     1
#> 4 C           1     2
#> 5 C           1     3
#> 6 B           2     1

Created on 2022-08-12 by the reprex package (v2.0.1)

  • Related