How to append date/duration spells in R-CodePudding

I have data with start and end dates, and if they "chain" together then they should be part of the same event.

A <- data.frame(group = c("A", "A", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E", "E"),
           startdate = c("2019-01-01", "2019-01-03", "2019-01-25", "2019-01-19", "2019-01-20", "2019-01-22", "2019-01-09", "2019-01-11", "2019-01-27", "2019-01-02", "2019-01-03", "2019-01-09", "2019-01-12"),
           enddate = c("2019-01-03", "2019-01-05", "2019-01-26", "2019-01-20", "2019-01-22", "2019-01-23", "2019-01-11", "2019-01-12", "2019-02-01", "2019-01-03", "2019-01-05", "2019-01-12", "2019-01-13")) %>% 
  mutate(outcome = c(runif(nrow(.))))
A
  group  startdate    enddate     outcome
1      A 2019-01-01 2019-01-03 0.955011991
2      A 2019-01-03 2019-01-05 0.476095975
3      B 2019-01-25 2019-01-26 0.005301758
4      C 2019-01-19 2019-01-20 0.179261809
5      C 2019-01-20 2019-01-22 0.688228826
6      C 2019-01-22 2019-01-23 0.248906204
7      D 2019-01-09 2019-01-11 0.151737360
8      D 2019-01-11 2019-01-12 0.503649969
9      D 2019-01-27 2019-02-01 0.310691439
10     E 2019-01-02 2019-01-03 0.152001235
11     E 2019-01-03 2019-01-05 0.434751253
12     E 2019-01-09 2019-01-12 0.510411369
13     E 2019-01-12 2019-01-13 0.214607322

What I would like is another column that gives an eventID:

  group  startdate    enddate     outcome eventID
1      A 2019-01-01 2019-01-03 0.955011991       1
2      A 2019-01-03 2019-01-05 0.476095975       1
3      B 2019-01-25 2019-01-26 0.005301758       1
4      C 2019-01-19 2019-01-20 0.179261809       1
5      C 2019-01-20 2019-01-22 0.688228826       1
6      C 2019-01-22 2019-01-23 0.248906204       1
7      D 2019-01-09 2019-01-11 0.151737360       1
8      D 2019-01-11 2019-01-12 0.503649969       1
9      D 2019-01-27 2019-02-01 0.310691439       2
10     E 2019-01-02 2019-01-03 0.152001235       1
11     E 2019-01-03 2019-01-05 0.434751253       1
12     E 2019-01-09 2019-01-12 0.510411369       2
13     E 2019-01-12 2019-01-13 0.214607322       2

I can then make a new person-event ID and sum the outcome by that new ID.

The only solution I have right now cannot deal with two "chained" events for the same group (group E in the example)

library(tidyverse)
A %>% 
  group_by(group) %>% 
  mutate(datediff = as.Date(startdate) - lag(as.Date(enddate))) %>% 
  mutate(eventID = ifelse(is.na(datediff)|datediff == 0, 1,n())) %>% 
  mutate(newID = paste0(group, eventID)) %>% 
  group_by(newID) %>% 
  summarise(outcome = sum(outcome))

This is similar to this question, but the data are not a time series.

CodePudding user response：

We may use cumsum after creating the 'eventID'

library(dplyr)
A %>% 
  group_by(group) %>% 
  mutate(datediff = as.Date(startdate) - lag(as.Date(enddate))) %>%  
  mutate(eventID = ifelse(is.na(datediff)|datediff == 0, 1,n()), 
         eventID = 1   cumsum(eventID != 1), datediff = NULL) %>%
  ungroup

-output

# A tibble: 13 × 5
   group startdate  enddate    outcome eventID
   <chr> <chr>      <chr>        <dbl>   <dbl>
 1 A     2019-01-01 2019-01-03  0.0589       1
 2 A     2019-01-03 2019-01-05  0.460        1
 3 B     2019-01-25 2019-01-26  0.506        1
 4 C     2019-01-19 2019-01-20  0.157        1
 5 C     2019-01-20 2019-01-22  0.303        1
 6 C     2019-01-22 2019-01-23  0.200        1
 7 D     2019-01-09 2019-01-11  0.117        1
 8 D     2019-01-11 2019-01-12  0.851        1
 9 D     2019-01-27 2019-02-01  0.340        2
10 E     2019-01-02 2019-01-03  0.0618       1
11 E     2019-01-03 2019-01-05  0.645        1
12 E     2019-01-09 2019-01-12  0.764        2
13 E     2019-01-12 2019-01-13  0.537        2