How to aggregate data based on dates in R?-CodePudding

I have a database that for the same event has multiple different sub-events that I would like to aggregate into a single event.

I would like to aggregate it only when the start date of the next record is the next day following the end date of the first record. So if a start date is 22/05/2015 and the end date for record 1 is 31/05/2015, and the start date of record 2 is 01/06/2015 with 15/06/2015, then aggregate the rows into 1 row so that, record 1 shows 22/05/2015 as start date and the end date is 15/06/2015.

For example it looks like;

Name DOB Start Date End Date
John Doe 1/01/2000 22/05/2015 31/05/2015

John Doe 1/01/2000 1/06/2015 15/06/2015

John Doe 1/01/2000 16/06/2015 20/06/2015

John Doe 1/01/2000 7/07/2015 8/07/2015

Jane Doe 1/01/1985 20/06/2018 21/06/2018

Jane Doe 1/01/1985 22/06/2018 30/06/2018

Jane Doe 1/01/1985 1/07/2018 2/07/2018

Jane Doe 1/01/1985 30/07/2018 31/07/2018

I would to aggregate it to the following;

Name DOB Start Date End Date

John Doe 1/01/2000 22/05/2015 20/06/2015

John Doe 1/01/2000 7/07/2015 8/07/2015

Jane Doe 1/01/1985 20/06/2018 2/07/2018

Jane Doe 1/01/1985 30/07/2018 31/07/2018

I have used the following code but it does not work very well.

ddply(df,~name dob,summarise, actualstart=min(start.date), actualend=max(end.date))

The issue is that it also aggregates the records that do not directly follow.

Please help, thank you.

CodePudding user response：

Here is one option using dplyr.

Take the difference between current Start_date and the previous End_date if the difference is greater than 1 day then merge the dates.

library(dplyr)

df %>%
  mutate(across(-Name, lubridate::dmy)) %>%
  group_by(Name) %>%
  group_by(grp = cumsum(Start_Date - lag(End_Date, default = first(Start_Date)) > 1), .add = TRUE) %>%
  summarise(DOB = first(DOB), 
            Start_Date = min(Start_Date), 
            End_Date = max(End_Date), .groups = 'drop') %>%
  select(-grp)

#   Name    DOB        Start_Date End_Date  
#  <chr>   <date>     <date>     <date>    
#1 JaneDoe 1985-01-01 2018-06-20 2018-07-02
#2 JaneDoe 1985-01-01 2018-07-30 2018-07-31
#3 JohnDoe 2000-01-01 2015-05-22 2015-06-20
#4 JohnDoe 2000-01-01 2015-07-07 2015-07-08

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(Name = c("JohnDoe", "JohnDoe", "JohnDoe", "JohnDoe", 
"JaneDoe", "JaneDoe", "JaneDoe", "JaneDoe"), DOB = c("1/01/2000", 
"1/01/2000", "1/01/2000", "1/01/2000", "1/01/1985", "1/01/1985", 
"1/01/1985", "1/01/1985"), Start_Date = c("22/05/2015", "1/06/2015", 
"16/06/2015", "7/07/2015", "20/06/2018", "22/06/2018", "1/07/2018", 
"30/07/2018"), End_Date = c("31/05/2015", "15/06/2015", "20/06/2015", 
"8/07/2015", "21/06/2018", "30/06/2018", "2/07/2018", "31/07/2018"
)), class = "data.frame", row.names = c(NA, -8L))