I have data that has multiple observations per id and is arranged in order of date. I would like to be able to delete all of the rows that occur after the first occurrence of a factor within each id. The factor, called 'ind' in the example below, has multiple levels. Most rows will be level "a". In the example below, I would like to keep all rows until a "b" or a "c" occurs, and delete all rows that follow the first occurrence of one or both of those values. In some situations, a "b" and "c" occur within the same id and on the exact same date and time (as in id=3 in my example below). In this situation, I would like to keep the rows that have "b" and "c", instead of simply deleting any rows that follow the whichever value comes first. Sometimes an id has no "b" or "c" values, in which case I want to keep all rows for that id. Sometimes an id has only a "b" and/or "c" value and no "a" values, in which case I would also keep all rows as long as they had the same date and time if there were both "b" and "c" values. Here is code for an example dataset:
id <- c(1,1,1,2,2,3,3,3,3,3,4,5,6,6)
ind <- c("a","b","a","a","c","a","a","b","c","a","a","b","a","a")
dt <- c("2017-06-30 00:00:00", "2017-07-1 00:00:00", "2017-07-04 00:00:00",
"2017-05-23 00:00:00", "2017-05-30 00:00:00",
"2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00", "2017-08-15 00:00:00",
"2017-06-05 00:00:00",
"2017-06-21 00:00:00",
"2017-01-20 00:00:00", "2017-01-21 00:00:00")
dat <- cbind(id,ind,dt)
dat <- data.frame(dat)
dat$id <- factor(dat$id)
dat$ind <- factor(dat$ind)
dat$dt <- as.POSIXct(dat$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat
I would like to obtain a dataset that looks like this (the last row of id=1 and the last row of id=3 was deleted):
id <- c(1,1,2,2,3,3,3,3,4,5,6,6)
ind <- c("a","b","a","c","a","a","b","c","a","b","a","a")
dt <- c("2017-06-30 00:00:00", "2017-07-1 00:00:00",
"2017-05-23 00:00:00", "2017-05-30 00:00:00",
"2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
"2017-06-05 00:00:00",
"2017-06-21 00:00:00",
"2017-01-20 00:00:00", "2017-01-21 00:00:00")
dat2 <- cbind(id,ind,dt)
dat2 <- data.frame(dat2)
dat2$id <- factor(dat2$id)
dat2$ind <- factor(dat2$ind)
dat2$dt <- as.POSIXct(dat2$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat2
If there was only two levels in the factor, or if id's didn't sometimes have more than one factor level for the same date/time, then I could simply do something like this:
dat %>% group_by(id) %>% arrange(dt, .by_group = TRUE) %>% filter(cumsum(cumsum(ind == "b")) <= 1)
as suggested in this question. However, this won't work in my case, because in the example above for id=3, it would only take one of the non-"a" value rows.
Thanks for the help!
CodePudding user response:
We arrange
the data by 'id', 'dt', grouped by 'id', get the max
of the index of match
of a vector (c("b", "c")
) with the ind
column, replace the 0 with NA
(na_if
), do a coalesce
with the number of rows (n()
), create a logical vector with row_number()
to filter
the rows
library(dplyr)
dat %>%
arrange(id, dt) %>%
group_by(id) %>%
filter(row_number() <= coalesce(na_if(max(match(c("b", "c"),
ind, nomatch = 0)), 0), n())) %>%
ungroup
-output
# A tibble: 12 × 3
id ind dt
<fct> <fct> <dttm>
1 1 a 2017-06-30 00:00:00
2 1 b 2017-07-01 00:00:00
3 2 a 2017-05-23 00:00:00
4 2 c 2017-05-30 00:00:00
5 3 a 2017-08-04 00:00:00
6 3 a 2017-08-07 00:00:00
7 3 b 2017-08-10 00:00:00
8 3 c 2017-08-10 00:00:00
9 4 a 2017-06-05 00:00:00
10 5 b 2017-06-21 00:00:00
11 6 a 2017-01-20 00:00:00
12 6 a 2017-01-21 00:00:00