Similar to Extract rows for the first occurrence of a variable in a data frame Dataset example:
ID grade date
A 1 1/1/2001
A 3 1/2/2002
A 5 2/4/2002
B 4 1/1/2008
B 4 1/1/2010
B 5 1/1/2011
B 5 1/3/2011
B 4 1/5/2011
C 2 1/1/2006
C 5 1/1/2007
C 5 1/1/2008
D 3 1/1/1996
D 5 1/1/1997
D 5 1/1/1999
E 1 1/1/2003
E 3 1/1/2005
E 3 1/1/2007
I would like to keep all entries per person until they hit grade 5 and then stop irrespective of grades after the earliest mention of 5. Not all IDs have reached stage 5, they should remain untouched even if they have multiple of the same grade
Desired output:
ID grade date
A 1 1/1/2001
A 3 1/2/2002
A 5 2/4/2002
B 4 1/1/2008
B 4 1/1/2010
B 5 1/1/2011
C 2 1/1/2006
C 5 1/1/2007
D 3 1/1/1996
D 5 1/1/1997
E 1 1/1/2003
E 3 1/1/2005
E 3 1/1/2007
I can used order and in dplyr the filter function but I am unsure how to apply these to just the 5's per person and then exclude anything afterwards?
Many thanks
CodePudding user response:
You can use slice
to grab the sequence from 1 until the first appearance of 5, i.e.
library(dplyr)
df %>%
group_by(ID) %>%
slice(1L : first(which(grade == 5)[1]))
# A tibble: 10 × 3
# Groups: ID [4]
ID grade date
<chr> <int> <chr>
1 A 1 1/1/2001
2 A 3 1/2/2002
3 A 5 2/4/2002
4 B 4 1/1/2008
5 B 4 1/1/2010
6 B 5 1/1/2011
7 C 2 1/1/2006
8 C 5 1/1/2007
9 D 3 1/1/1996
10 D 5 1/1/1997
EDIT If you have groups that do NOT have grade = 5, then
df %>%
group_by(ID) %>%
mutate(new = which(grade == 5)[1],
new = replace(new, is.na(new), max(row_number()))) %>%
slice(1L : first(new)) %>%
select(-new)
# A tibble: 13 × 4
# Groups: ID [5]
ID grade date
<chr> <int> <chr>
1 A 1 1/1/2001
2 A 3 1/2/2002
3 A 5 2/4/2002
4 B 4 1/1/2008
5 B 4 1/1/2010
6 B 5 1/1/2011
7 C 2 1/1/2006
8 C 5 1/1/2007
9 D 3 1/1/1996
10 D 5 1/1/1997
11 E 1 1/1/2003
12 E 3 1/1/2005
13 E 3 1/1/2007