Home > Software engineering >  How to aggregate data based on dates in R?
How to aggregate data based on dates in R?

Time:09-24

I have a database that for the same event has multiple different sub-events that I would like to aggregate into a single event.

I would like to aggregate it only when the start date of the next record is the next day following the end date of the first record. So if a start date is 22/05/2015 and the end date for record 1 is 31/05/2015, and the start date of record 2 is 01/06/2015 with 15/06/2015, then aggregate the rows into 1 row so that, record 1 shows 22/05/2015 as start date and the end date is 15/06/2015.

For example it looks like;

Name DOB Start Date End Date
John Doe 1/01/2000 22/05/2015 31/05/2015

John Doe 1/01/2000 1/06/2015 15/06/2015

John Doe 1/01/2000 16/06/2015 20/06/2015

John Doe 1/01/2000 7/07/2015 8/07/2015

Jane Doe 1/01/1985 20/06/2018 21/06/2018

Jane Doe 1/01/1985 22/06/2018 30/06/2018

Jane Doe 1/01/1985 1/07/2018 2/07/2018

Jane Doe 1/01/1985 30/07/2018 31/07/2018

I would to aggregate it to the following;

Name DOB Start Date End Date

John Doe 1/01/2000 22/05/2015 20/06/2015

John Doe 1/01/2000 7/07/2015 8/07/2015

Jane Doe 1/01/1985 20/06/2018 2/07/2018

Jane Doe 1/01/1985 30/07/2018 31/07/2018

I have used the following code but it does not work very well.

ddply(df,~name dob,summarise, actualstart=min(start.date), actualend=max(end.date))

The issue is that it also aggregates the records that do not directly follow.

Please help, thank you.

CodePudding user response:

Here is one option using dplyr.

Take the difference between current Start_date and the previous End_date if the difference is greater than 1 day then merge the dates.

library(dplyr)

df %>%
  mutate(across(-Name, lubridate::dmy)) %>%
  group_by(Name) %>%
  group_by(grp = cumsum(Start_Date - lag(End_Date, default = first(Start_Date)) > 1), .add = TRUE) %>%
  summarise(DOB = first(DOB), 
            Start_Date = min(Start_Date), 
            End_Date = max(End_Date), .groups = 'drop') %>%
  select(-grp)

#   Name    DOB        Start_Date End_Date  
#  <chr>   <date>     <date>     <date>    
#1 JaneDoe 1985-01-01 2018-06-20 2018-07-02
#2 JaneDoe 1985-01-01 2018-07-30 2018-07-31
#3 JohnDoe 2000-01-01 2015-05-22 2015-06-20
#4 JohnDoe 2000-01-01 2015-07-07 2015-07-08

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(Name = c("JohnDoe", "JohnDoe", "JohnDoe", "JohnDoe", 
"JaneDoe", "JaneDoe", "JaneDoe", "JaneDoe"), DOB = c("1/01/2000", 
"1/01/2000", "1/01/2000", "1/01/2000", "1/01/1985", "1/01/1985", 
"1/01/1985", "1/01/1985"), Start_Date = c("22/05/2015", "1/06/2015", 
"16/06/2015", "7/07/2015", "20/06/2018", "22/06/2018", "1/07/2018", 
"30/07/2018"), End_Date = c("31/05/2015", "15/06/2015", "20/06/2015", 
"8/07/2015", "21/06/2018", "30/06/2018", "2/07/2018", "31/07/2018"
)), class = "data.frame", row.names = c(NA, -8L))
  • Related