I'm trying to wrap my head around how to use data.table::foverlaps()
to generate new data tables. In one application, I would like to use foverlaps to identify gaps and then use this information to truncate my original data table.
Suppose that I have a dataset (df1
) of 2 employees (id
) at a company with date ranges (start_date
and end_date
) for the periods in which they work on different projects
(proj_id
; either "A", "B" or "C").
library(data.table)
library(lubridate)
df1<-data.table(id = rep(1:2,each=3),
start_date = ymd(c("1998-04-03","1999-03-08","2000-08-13",
"2005-03-03","2007-10-12","2014-02-23")),
end_date = ymd(c("1999-03-07","2000-08-12","2021-04-23",
"2007-09-05","2014-02-22","2019-05-04")),
proj_id = c("A","B","A","B","C","A"))
> df1
id start_date end_date proj_id
1: 1 1998-04-03 1999-03-07 A
2: 1 1999-03-08 2000-08-12 B
3: 1 2000-08-13 2021-04-23 A
4: 2 2005-03-03 2007-09-05 B
5: 2 2007-10-12 2014-02-22 C
6: 2 2014-02-23 2019-05-04 A
Now I have another dataset (df2
) that specifies the time that I want to truncate from df1
.
df2 <- data.table(id = 1:2,
start_date = ymd("1998-07-20", "2006-06-12"),
end_date = ymd("1998-08-15", "2016-04-08"))
> df2
id start_date end_date
1: 1 1998-07-20 1998-08-15
2: 2 2006-06-12 2016-04-08
I can then use data.table::foverlaps()
to identify the overlapping episodes:
> setkey(df1,id,start_date,end_date)
> foverlaps(df2, df1, type="any",
by.x=c("id","start_date","end_date"))
id start_date end_date proj_id i.start_date i.end_date
1: 1 1998-04-03 1999-03-07 A 1998-07-20 1998-08-15
2: 2 2005-03-03 2007-09-05 B 2006-06-12 2016-04-08
3: 2 2007-10-12 2014-02-22 C 2006-06-12 2016-04-08
4: 2 2014-02-23 2019-05-04 A 2006-06-12 2016-04-08
I would now like to use this data to generate a new version of df1
, where I generate new episodes by truncating the gaps identified above. My desired DT is therefore:
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A
```
CodePudding user response:
There may be alternatives that work better, but this could work based on your foverlaps
result.
Assume you created another data.table called df3
with your foverlaps
result:
df3 <- foverlaps(df2, df1, type = "any", by.x = c("id", "start_date", "end_date"))
Then you could iterate through each row, and add 0, 1, or 2 date ranges depending on overlap (truncate at end, or beginning, or entire range is blocked out).
dt <- data.table(start_date = Date(), end_date = Date(), id = numeric(), proj_id = numeric())
for (i in seq_len(nrow(df3))) {
if (df3$start_date[i] < df3$i.start_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$start_date[i], end_date = df3$i.start_date[i] - 1, id = df3$id[i], proj_id = df3$proj_id[i]))
}
if (df3$end_date[i] > df3$i.end_date[i]) {
dt <- rbind(dt, data.table(start_date = df3$i.end_date[i] 1, end_date = df3$end_date[i], id = df3$id[i], proj_id = df3$proj_id[i]))
}
}
Finally, you can remove the foverlaps
results from your initial df1
since new ranges have been determine for those (using fsetdiff
). Then, you can add the new ranges back.
rbind(fsetdiff(df1, df3[,1:4]), dt)[order(id, start_date)]
Output
id start_date end_date proj_id
1: 1 1998-04-03 1998-07-19 A
2: 1 1998-08-16 1999-03-07 A
3: 1 1999-03-08 2000-08-12 B
4: 1 2000-08-13 2021-04-23 A
5: 2 2005-03-03 2006-06-11 B
6: 2 2016-04-09 2019-05-04 A