If I have a data frame like the following
group1 | group2 | col1 | col2 |
---|---|---|---|
A | 1 | ABC | 5 |
A | 1 | DEF | 2 |
B | 1 | AB | 1 |
C | 1 | ABC | 5 |
C | 1 | DEF | 2 |
A | 2 | BC | 8 |
B | 2 | AB | 1 |
We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).
So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:
category | group1 | group2 |
---|---|---|
1 | A | 1 |
1 | C | 1 |
2 | B | 1 |
2 | B | 2 |
3 | A | 2 |
And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.
I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks
CodePudding user response:
You can use pivot_wider
first to handle identical groups over multiple rows.
library(tidyverse)
df %>%
group_by(group1, group2) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = c(col1, col2)) %>%
group_by(across(-c(group1, group2))) %>%
mutate(category = cur_group_id()) %>%
ungroup() %>%
select(category, group1, group2) %>%
arrange(category)
category group1 group2
<int> <chr> <int>
1 1 B 1
2 1 B 2
3 2 A 1
4 2 C 1
5 3 A 2
CodePudding user response:
You could first group_by
"col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id
like this:
library(dplyr)
library(tidyr)
df %>%
group_by(col1, col2) %>%
filter(n() != 1) %>%
mutate(ID = cur_group_id()) %>%
ungroup() %>%
select(-starts_with("col"))
#> # A tibble: 6 × 3
#> group1 group2 ID
#> <chr> <int> <int>
#> 1 A 1 2
#> 2 A 1 3
#> 3 B 1 1
#> 4 C 1 2
#> 5 C 1 3
#> 6 B 2 1
Created on 2022-08-12 by the reprex package (v2.0.1)