R: Create numbering within each group-CodePudding

The data that I have:

x = tibble(
  study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
  ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012)
)

The goal is to create the 'number' variable which shows the same number for each unique ID in sequence starting from 1.

goal = tibble(
  study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
  ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012),
  number = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
)

And then if within each ID group, the studies are incomplete (e.g., for number = 2, the studies are only A and B, instead of A, B, C), then how to remove the obs associated with that ID (e.g., remove obs that have a number of '2')?

Thanks!

Updated follow-up question on part B:

Once we have the goal dataset, I would like to remove the obs grouped by ID, that meet the following requirements in terms of the study variable:

A and D are required, one of B and C is required (so either B or C), and sometimes each letter will appear more than once.

x = tibble(
  study = c("A", "B", "C", "D", "A", "B", "A", "B", "C", "A", "B", "C", "D", "D", "A", "B", "D", "B", "C", "D"),
  ID = c(001, 001, 001, 001, 005, 005, 007, 007, 007, 012, 012, 012, 012, 012, 013, 013, 013, 018, 018, 018),
  number = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6)
)

So in the goal dataset above, I would like to remove: (1) Obs #5 and 6 which share a group number of 2, because they don't have A, B or C, and D in the study variable. (2) Obs #18, 19, 20 which share a group number of 6, for the same reason as (1).

I would like to keep the rest of the obs because within each number group, they have A, B or C, and D. I cannot use filter(n() > 3) here, because that would delete obs with the number 5.

CodePudding user response：

We could use cur_group_id()

library(dplyr)

  x %>% 
    group_by(ID) %>% 
    mutate(number = cur_group_id())

  study    ID  number
   <chr> <dbl> <int>
 1 A         1     1
 2 B         1     1
 3 C         1     1
 4 A         5     2
 5 B         5     2
 6 A         7     3
 7 B         7     3
 8 C         7     3
 9 A        12     4
10 B        12     4

library(dplyr)
x %>% 
  mutate(number = cumsum(ID != lag(ID, default = first(ID))) 1)

   study    ID number
   <chr> <dbl>  <dbl>
 1 A         1      1
 2 B         1      1
 3 C         1      1
 4 A         5      2
 5 B         5      2
 6 A         7      3
 7 B         7      3
 8 C         7      3
 9 A        12      4
10 B        12      4

CodePudding user response：

A) The dplyr package offers group_indices() for adding unique group indentifiers:

library(dplyr)

df$number <- df %>% 
  group_indices(ID)
df

# A tibble: 10 × 3
   study    ID number
   <chr> <dbl>  <int>
 1 A         1      1
 2 B         1      1
 3 C         1      1
 4 A         5      2
 5 B         5      2
...

B) You can drop observations where the group size is less than 3 (i.e., "A", "B" and "C") with filter():

df %>% 
  group_by(ID) %>% 
  filter(n() == 3)

# A tibble: 6 × 3
# Groups:   ID [2]
  study    ID number
  <chr> <dbl>  <int>
1 A         1      1
2 B         1      1
3 C         1      1
4 A         7      3
5 B         7      3
6 C         7      3

CodePudding user response：

A and D are required, one of B and C is required (so either B or C)

df %>%
  group_by(ID) %>%
  mutate(
    flag = 
      (
        any(study %in% c("A")) & 
        any(study %in% c("D"))
      ) &
      (
        any(study %in% c("B")) |
        any(study %in% c("C"))
      )
  ) %>% 
  filter(flag)

# A tibble: 12 × 4
# Groups:   ID [3]
   study    ID number flag 
   <chr> <dbl>  <dbl> <lgl>
 1 A         1      1 TRUE 
 2 B         1      1 TRUE 
 3 C         1      1 TRUE 
 4 D         1      1 TRUE 
 5 A        12      4 TRUE 
 6 B        12      4 TRUE 
 7 C        12      4 TRUE 
 8 D        12      4 TRUE 
 9 D        12      4 TRUE 
10 A        13      5 TRUE 
11 B        13      5 TRUE 
12 D        13      5 TRUE