Given 2 datasets with dates, how would I accurate assign a unit in one dataframe to an event in the other?
I tried the fuzzyjoin package, but have not been successful. The challenge is that ancillary events can overlap 2 anchor events and so they must be assigned based upon the start and end date.
library(tidyverse)
library(lubridate)
library(fuzzyjoin)
anchor_df <- tribble(
~person, ~anchor_beg, ~anchor_end,
'a' , '01-01-2020', '01-05-2020' ,
'a' , '01-17-2020', '01-18-2020' ,
'a' , '02-11-2020', '02-22-2020' ,
'b' , '04-01-2020', '04-07-2020'
)
ancillary_df <- tribble(
~person, ~anc_start , ~anc_end, ~units,
'a' , '01-07-2020', '01-11-2020' ,3,
'a' , '02-24-2020', '03-22-2020' , 15,
'b' , '04-08-2020', '06-07-2020', 25
)
anchor_df$anchor_beg <- mdy(anchor_df$anchor_beg)
anchor_df$anchor_end <- mdy(anchor_df$anchor_end)
ancillary_df$anc_start <- mdy(ancillary_df$anc_start)
ancillary_df$anc_end <- mdy(ancillary_df$anc_end)
fuzzy_left_join(
ancillary_df, anchor_df,
by = c(
"person" = "person",
"anc_start" = "anchor_end",
"anc_start" = "anchor_beg"
),
match_fun = list(`==`, `>=`, `<=`)
)
My desired output is:
I appreciate any pointers
CodePudding user response:
We can do this with data.table
, which is able to perform "range joins."
We'll need to convert your date fields to dates (or at least, non-character vectors) and then turn each data.frame
into a data.table
.
library(tidyr)
library(dplyr)
library(data.table)
anchor_df <- tribble(
~person, ~anchor_beg, ~anchor_end,
'a' , '01-01-2020', '01-05-2020' ,
'a' , '01-17-2020', '01-18-2020' ,
'a' , '02-11-2020', '02-22-2020' ,
'b' , '04-01-2020', '04-07-2020'
)
ancillary_df <- tribble(
~person, ~anc_start , ~anc_end, ~units,
'a' , '01-07-2020', '01-11-2020' ,3,
'a' , '02-24-2020', '03-22-2020' , 15,
'b' , '04-08-2020', '06-07-2020', 25
)
# Dates need to be date class
anchor_df <- anchor_df %>%
mutate(
anchor_beg = as.Date(anchor_beg, "%m-%d-%Y"),
anchor_end = as.Date(anchor_end, "%m-%d-%Y")
)
ancillary_df <- ancillary_df %>%
mutate(
anc_start = as.Date(anc_start, "%m-%d-%Y"),
anc_end = as.Date(anc_end, "%m-%d-%Y")
)
# Convert to data.table
anchor_dt <- as.data.table(anchor_df)
ancillary_dt <- as.data.table(ancillary_df)
We can then check whether an anchor starts after the ancillary start and ends before the ancillary ends. To ensure we match to the correct persons, we also match on person.
# Range join with data.table
anchor_dt[ancillary_dt, on = .(anchor_beg >= anc_start,
anchor_end <= anc_end,
person == person)]
#> person anchor_beg anchor_end units
#> 1: a 2020-01-07 2020-01-11 3
#> 2: a 2020-02-24 2020-03-22 15
#> 3: b 2020-04-08 2020-06-07 25
CodePudding user response:
Certainly not an elegant solution and I am sure there are better alternatives, but this seems to solve what I am trying to do.
anchor_df %>%
full_join(ancillary_df, by = "person") %>%
filter(anc_start > anchor_end) %>%
group_by(person, anc_start) %>%
mutate(rn = row_number()) %>%
slice(which.max(rn)) %>%
select(-rn) %>%
full_join(anchor_df, by= c("person" = "person", "anchor_beg"="anchor_beg")) %>%
select(person, anchor_beg, anchor_end.y, anc_start, anc_end, units) %>%
rename(anchor_end = anchor_end.y) %>%
mutate(units = ifelse(is.na(units), 0, units)) %>%
arrange(person, anchor_beg)