How can we conditionally join and summarize 2 dataframes based upon date criteria-CodePudding

Given 2 datasets with dates, how would I accurate assign a unit in one dataframe to an event in the other?

I tried the fuzzyjoin package, but have not been successful. The challenge is that ancillary events can overlap 2 anchor events and so they must be assigned based upon the start and end date.

library(tidyverse)
library(lubridate)
library(fuzzyjoin)

anchor_df <- tribble(
  ~person, ~anchor_beg, ~anchor_end,
  'a' , '01-01-2020', '01-05-2020' ,
  'a' , '01-17-2020', '01-18-2020' ,
  'a' , '02-11-2020', '02-22-2020' ,
  'b' , '04-01-2020', '04-07-2020' 
)

ancillary_df <- tribble(
  ~person,  ~anc_start , ~anc_end, ~units,
  'a' , '01-07-2020', '01-11-2020' ,3,
  'a' , '02-24-2020', '03-22-2020' , 15,
  'b' , '04-08-2020', '06-07-2020', 25 
)

anchor_df$anchor_beg <- mdy(anchor_df$anchor_beg)
anchor_df$anchor_end <- mdy(anchor_df$anchor_end)
ancillary_df$anc_start <- mdy(ancillary_df$anc_start)
ancillary_df$anc_end <- mdy(ancillary_df$anc_end)


fuzzy_left_join(
  ancillary_df, anchor_df,
  by = c(
    "person" = "person",
    "anc_start" = "anchor_end",
    "anc_start" = "anchor_beg"
  ),
  match_fun = list(`==`, `>=`, `<=`)
)

My desired output is:

I appreciate any pointers

CodePudding user response：

We can do this with data.table, which is able to perform "range joins."

We'll need to convert your date fields to dates (or at least, non-character vectors) and then turn each data.frame into a data.table.

library(tidyr)
library(dplyr)
library(data.table)

anchor_df <- tribble(
  ~person, ~anchor_beg, ~anchor_end,
  'a' , '01-01-2020', '01-05-2020' ,
  'a' , '01-17-2020', '01-18-2020' ,
  'a' , '02-11-2020', '02-22-2020' ,
  'b' , '04-01-2020', '04-07-2020' 
)

ancillary_df <- tribble(
  ~person,  ~anc_start , ~anc_end, ~units,
  'a' , '01-07-2020', '01-11-2020' ,3,
  'a' , '02-24-2020', '03-22-2020' , 15,
  'b' , '04-08-2020', '06-07-2020', 25 
)

# Dates need to be date class
anchor_df <- anchor_df %>% 
  mutate(
    anchor_beg = as.Date(anchor_beg, "%m-%d-%Y"),
    anchor_end = as.Date(anchor_end, "%m-%d-%Y")
  )

ancillary_df <- ancillary_df %>% 
  mutate(
    anc_start = as.Date(anc_start, "%m-%d-%Y"),
    anc_end = as.Date(anc_end, "%m-%d-%Y")
  )

# Convert to data.table
anchor_dt <- as.data.table(anchor_df)
ancillary_dt <- as.data.table(ancillary_df)

We can then check whether an anchor starts after the ancillary start and ends before the ancillary ends. To ensure we match to the correct persons, we also match on person.

# Range join with data.table
anchor_dt[ancillary_dt, on = .(anchor_beg >= anc_start,
                               anchor_end <= anc_end,
                               person == person)]
#>    person anchor_beg anchor_end units
#> 1:      a 2020-01-07 2020-01-11     3
#> 2:      a 2020-02-24 2020-03-22    15
#> 3:      b 2020-04-08 2020-06-07    25

CodePudding user response：

Certainly not an elegant solution and I am sure there are better alternatives, but this seems to solve what I am trying to do.

anchor_df %>% 
    full_join(ancillary_df, by = "person") %>% 
    filter(anc_start > anchor_end) %>% 
    group_by(person, anc_start) %>% 
    mutate(rn = row_number()) %>% 
    slice(which.max(rn)) %>% 
    select(-rn) %>% 
    full_join(anchor_df, by= c("person" = "person", "anchor_beg"="anchor_beg")) %>% 
    select(person, anchor_beg, anchor_end.y, anc_start, anc_end, units) %>% 
    rename(anchor_end = anchor_end.y) %>% 
    mutate(units = ifelse(is.na(units), 0, units)) %>% 
    arrange(person, anchor_beg)