I have data with start and end dates, and if they "chain" together then they should be part of the same event.
A <- data.frame(group = c("A", "A", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E", "E"),
startdate = c("2019-01-01", "2019-01-03", "2019-01-25", "2019-01-19", "2019-01-20", "2019-01-22", "2019-01-09", "2019-01-11", "2019-01-27", "2019-01-02", "2019-01-03", "2019-01-09", "2019-01-12"),
enddate = c("2019-01-03", "2019-01-05", "2019-01-26", "2019-01-20", "2019-01-22", "2019-01-23", "2019-01-11", "2019-01-12", "2019-02-01", "2019-01-03", "2019-01-05", "2019-01-12", "2019-01-13")) %>%
mutate(outcome = c(runif(nrow(.))))
A
group startdate enddate outcome
1 A 2019-01-01 2019-01-03 0.955011991
2 A 2019-01-03 2019-01-05 0.476095975
3 B 2019-01-25 2019-01-26 0.005301758
4 C 2019-01-19 2019-01-20 0.179261809
5 C 2019-01-20 2019-01-22 0.688228826
6 C 2019-01-22 2019-01-23 0.248906204
7 D 2019-01-09 2019-01-11 0.151737360
8 D 2019-01-11 2019-01-12 0.503649969
9 D 2019-01-27 2019-02-01 0.310691439
10 E 2019-01-02 2019-01-03 0.152001235
11 E 2019-01-03 2019-01-05 0.434751253
12 E 2019-01-09 2019-01-12 0.510411369
13 E 2019-01-12 2019-01-13 0.214607322
What I would like is another column that gives an eventID:
group startdate enddate outcome eventID
1 A 2019-01-01 2019-01-03 0.955011991 1
2 A 2019-01-03 2019-01-05 0.476095975 1
3 B 2019-01-25 2019-01-26 0.005301758 1
4 C 2019-01-19 2019-01-20 0.179261809 1
5 C 2019-01-20 2019-01-22 0.688228826 1
6 C 2019-01-22 2019-01-23 0.248906204 1
7 D 2019-01-09 2019-01-11 0.151737360 1
8 D 2019-01-11 2019-01-12 0.503649969 1
9 D 2019-01-27 2019-02-01 0.310691439 2
10 E 2019-01-02 2019-01-03 0.152001235 1
11 E 2019-01-03 2019-01-05 0.434751253 1
12 E 2019-01-09 2019-01-12 0.510411369 2
13 E 2019-01-12 2019-01-13 0.214607322 2
I can then make a new person-event ID and sum the outcome by that new ID.
The only solution I have right now cannot deal with two "chained" events for the same group (group E in the example)
library(tidyverse)
A %>%
group_by(group) %>%
mutate(datediff = as.Date(startdate) - lag(as.Date(enddate))) %>%
mutate(eventID = ifelse(is.na(datediff)|datediff == 0, 1,n())) %>%
mutate(newID = paste0(group, eventID)) %>%
group_by(newID) %>%
summarise(outcome = sum(outcome))
This is similar to this question, but the data are not a time series.
CodePudding user response:
We may use cumsum
after creating the 'eventID'
library(dplyr)
A %>%
group_by(group) %>%
mutate(datediff = as.Date(startdate) - lag(as.Date(enddate))) %>%
mutate(eventID = ifelse(is.na(datediff)|datediff == 0, 1,n()),
eventID = 1 cumsum(eventID != 1), datediff = NULL) %>%
ungroup
-output
# A tibble: 13 × 5
group startdate enddate outcome eventID
<chr> <chr> <chr> <dbl> <dbl>
1 A 2019-01-01 2019-01-03 0.0589 1
2 A 2019-01-03 2019-01-05 0.460 1
3 B 2019-01-25 2019-01-26 0.506 1
4 C 2019-01-19 2019-01-20 0.157 1
5 C 2019-01-20 2019-01-22 0.303 1
6 C 2019-01-22 2019-01-23 0.200 1
7 D 2019-01-09 2019-01-11 0.117 1
8 D 2019-01-11 2019-01-12 0.851 1
9 D 2019-01-27 2019-02-01 0.340 2
10 E 2019-01-02 2019-01-03 0.0618 1
11 E 2019-01-03 2019-01-05 0.645 1
12 E 2019-01-09 2019-01-12 0.764 2
13 E 2019-01-12 2019-01-13 0.537 2