I have a data frame of an ordered record set of events relating to a group with a unique id (grpid), where the record is the event (event) and one group id may relate to one or many events, but all the events have rank order variable (ordid). The task I am trying to achieve is to return the earliest event in the order of events in each group that meets a set of criteria. In the event that no records meet the criteria, simply return the earliest. I need them returning into a new data set. So in the minimally reproduceable example in data.frame ‘d’, my criterion is to return the first event that is not begin with ‘x’ or ‘y’ in each group, but if all the events begin with x or y then to accept the first occurring record even if it is an x/y one. So in group ‘a’ the first event, d21 meets my criterion so my new set will include ‘a’ and ‘d21’. In the second group there are four records. The first two are rejected but ‘b’ ‘j10’ is selected as it is the lowest ranking record not beginning with 'x' or 'y'. In the last case, all the records begin with an ‘x’ or a ‘y’, so I simply want the lowest ranking one which is ‘x10’ The real data set are some hundreds of millions so efficiency is a consideration. Code for an MRE is provided below.
#make data.frame of sample data
grpid<-c('a','a','a','b','b','b','b','c','c')
ordid<-c(1,2,3,1,2,3,4,1,2)
event<-c('d21','e30','a10','x89','y77','j10','d17','x10','x22')
d<-data.frame(grpid,ordid,event)
d
#make data.frame of desired output
grpid1<-c('a','b','c')
event1<-c('d21','j10','x10')
out<-data.frame(grpid1,event1)
out
CodePudding user response:
Grouped by 'grpid', create an index where the first character of 'event' is not 'x' or 'y', then slice
the first element based on the index and if
all
the values are either 'x' or 'y', then return the 1st observation
library(dplyr)
d %>%
group_by(grpid) %>%
mutate(ind = event == event[!substr(event, 1, 1) %in% c('x', 'y')][1]) %>%
slice(if(all(is.na(ind))) 1 else which(ind)[1]) %>%
ungroup %>%
select(-ordid, -ind)
-output
# A tibble: 3 × 2
grpid event
<chr> <chr>
1 a d21
2 b j10
3 c x10
Or use summarise
with coalesce
d %>%
group_by(grpid) %>%
summarise(event = coalesce(event[!substr(event, 1, 1) %in%
c('x', 'y')][1], first(event)))
# A tibble: 3 × 2
grpid event
<chr> <chr>
1 a d21
2 b j10
3 c x10