Home > Enterprise >  Splitting a grouping column using another column and map it using an increasing sequence
Splitting a grouping column using another column and map it using an increasing sequence

Time:08-25

I have data that is uniquely identified with an id, and grouped by a sub_id and group_id. Each instance with a unique object_id generates an id. Similar instances concerning different objects are grouped by a sub_id. Multiple interactions for one instance with one or more objects are grouped by the group_id column. Essentially, if someone interacts twice for two objects, there will be 4 unique ids, two unique sub_ids and a single group_id. The only way to distinguish the difference in the duplicate interactions is through the object_id column:

library(tidyverse)


tibble(id = c(1:11),
       sub_id = c(1,2,1,2,3,4,3,4,5,6,6),
       group_id = c(rep(1,4),rep(3,4),5,6,6),
       object_id = c(1,1,2,2,2,2,3,3,2,2,3))


# A tibble: 11 × 4
      id sub_id group_id object_id
   <int>  <dbl>    <dbl>     <dbl>
 1     1      1        1         1
 2     2      2        1         1
 3     3      1        1         2
 4     4      2        1         2
 5     5      3        3         2
 6     6      4        3         2
 7     7      3        3         3
 8     8      4        3         3
 9     9      5        5         2
10    10      6        6         2
11    11      6        6         3

I would like some sort of sub_group_id that will split the group_id interactions in sequences using the object_id column. The expected output would look like this:

# A tibble: 11 × 5
      id sub_id group_id subgroup_id object_id
   <int>  <dbl>    <dbl>       <dbl>     <dbl>
 1     1      1        1           1         1
 2     2      2        1           1         1
 3     3      1        1           2         2
 4     4      2        1           2         2
 5     5      3        3           3         2
 6     6      4        3           3         2
 7     7      3        3           4         3
 8     8      4        3           4         3
 9     9      5        5           5         2
10    10      6        6           6         2
11    11      6        6           7         3

That column essentially needs to increase infinitly to distinguish between different subgroups. It ressembles the object_id column, but the same object can be present in different sub_ids, group_ids, and subgroup_ids, whereas each subgroup_id only appears in one group_id. Any ideas on how to achieve that output?

CodePudding user response:

group_by group_id and object_id, and then use cur_group_id to get the sequence of id for each subgroup:

dat %>% 
  group_by(group_id, object_id) %>% 
  mutate(subgroup_id = cur_group_id())

      id sub_id group_id object_id subgroup_id
   <int>  <dbl>    <dbl>     <dbl>       <int>
 1     1      1        1         1           1
 2     2      2        1         1           1
 3     3      1        1         2           2
 4     4      2        1         2           2
 5     5      3        3         2           3
 6     6      4        3         2           3
 7     7      3        3         3           4
 8     8      4        3         3           4
 9     9      5        5         2           5
10    10      6        6         2           6
11    11      6        6         3           7

CodePudding user response:

We could use

library(dplyr)
library(data.table)
 df1  %>% 
   mutate(subgroup_id = rleid(object_id, group_id))
  • Related