Home > Mobile >  Take the earliest mention of a figure from within a range by ID
Take the earliest mention of a figure from within a range by ID

Time:12-06

Similar to Extract rows for the first occurrence of a variable in a data frame Dataset example:

ID   grade    date 
A     1        1/1/2001
A     3        1/2/2002 
A     5        2/4/2002
B     4        1/1/2008
B     4        1/1/2010
B     5        1/1/2011
B     5        1/3/2011
B     4        1/5/2011
C     2        1/1/2006
C     5        1/1/2007
C     5        1/1/2008
D     3        1/1/1996
D     5        1/1/1997
D     5        1/1/1999
E     1        1/1/2003
E     3        1/1/2005
E     3        1/1/2007

I would like to keep all entries per person until they hit grade 5 and then stop irrespective of grades after the earliest mention of 5. Not all IDs have reached stage 5, they should remain untouched even if they have multiple of the same grade

Desired output:

ID   grade    date 
A     1        1/1/2001
A     3        1/2/2002 
A     5        2/4/2002
B     4        1/1/2008
B     4        1/1/2010
B     5        1/1/2011
C     2        1/1/2006
C     5        1/1/2007
D     3        1/1/1996
D     5        1/1/1997
E     1        1/1/2003
E     3        1/1/2005
E     3        1/1/2007

I can used order and in dplyr the filter function but I am unsure how to apply these to just the 5's per person and then exclude anything afterwards?

Many thanks

CodePudding user response:

You can use slice to grab the sequence from 1 until the first appearance of 5, i.e.

library(dplyr)

df %>% 
 group_by(ID) %>% 
 slice(1L : first(which(grade == 5)[1])) 

# A tibble: 10 × 3
# Groups:   ID [4]
   ID    grade date    
   <chr> <int> <chr>   
 1 A         1 1/1/2001
 2 A         3 1/2/2002
 3 A         5 2/4/2002
 4 B         4 1/1/2008
 5 B         4 1/1/2010
 6 B         5 1/1/2011
 7 C         2 1/1/2006
 8 C         5 1/1/2007
 9 D         3 1/1/1996
10 D         5 1/1/1997

EDIT If you have groups that do NOT have grade = 5, then

df %>% 
 group_by(ID) %>% 
 mutate(new = which(grade == 5)[1], 
        new = replace(new, is.na(new), max(row_number()))) %>% 
 slice(1L : first(new)) %>%
 select(-new)

# A tibble: 13 × 4
# Groups:   ID [5]
   ID    grade date       
   <chr> <int> <chr>    
 1 A         1 1/1/2001     
 2 A         3 1/2/2002     
 3 A         5 2/4/2002     
 4 B         4 1/1/2008     
 5 B         4 1/1/2010     
 6 B         5 1/1/2011     
 7 C         2 1/1/2006     
 8 C         5 1/1/2007     
 9 D         3 1/1/1996     
10 D         5 1/1/1997     
11 E         1 1/1/2003     
12 E         3 1/1/2005     
13 E         3 1/1/2007     
  • Related