Home > OS >  How can we conditionally join and summarize 2 dataframes based upon date criteria
How can we conditionally join and summarize 2 dataframes based upon date criteria

Time:07-27

Given 2 datasets with dates, how would I accurate assign a unit in one dataframe to an event in the other?

I tried the fuzzyjoin package, but have not been successful. The challenge is that ancillary events can overlap 2 anchor events and so they must be assigned based upon the start and end date.

library(tidyverse)
library(lubridate)
library(fuzzyjoin)

anchor_df <- tribble(
  ~person, ~anchor_beg, ~anchor_end,
  'a' , '01-01-2020', '01-05-2020' ,
  'a' , '01-17-2020', '01-18-2020' ,
  'a' , '02-11-2020', '02-22-2020' ,
  'b' , '04-01-2020', '04-07-2020' 
)

ancillary_df <- tribble(
  ~person,  ~anc_start , ~anc_end, ~units,
  'a' , '01-07-2020', '01-11-2020' ,3,
  'a' , '02-24-2020', '03-22-2020' , 15,
  'b' , '04-08-2020', '06-07-2020', 25 
)

anchor_df$anchor_beg <- mdy(anchor_df$anchor_beg)
anchor_df$anchor_end <- mdy(anchor_df$anchor_end)
ancillary_df$anc_start <- mdy(ancillary_df$anc_start)
ancillary_df$anc_end <- mdy(ancillary_df$anc_end)


fuzzy_left_join(
  ancillary_df, anchor_df,
  by = c(
    "person" = "person",
    "anc_start" = "anchor_end",
    "anc_start" = "anchor_beg"
  ),
  match_fun = list(`==`, `>=`, `<=`)
)

My desired output is:

enter image description here

I appreciate any pointers

CodePudding user response:

We can do this with data.table, which is able to perform "range joins."

We'll need to convert your date fields to dates (or at least, non-character vectors) and then turn each data.frame into a data.table.

library(tidyr)
library(dplyr)
library(data.table)

anchor_df <- tribble(
  ~person, ~anchor_beg, ~anchor_end,
  'a' , '01-01-2020', '01-05-2020' ,
  'a' , '01-17-2020', '01-18-2020' ,
  'a' , '02-11-2020', '02-22-2020' ,
  'b' , '04-01-2020', '04-07-2020' 
)

ancillary_df <- tribble(
  ~person,  ~anc_start , ~anc_end, ~units,
  'a' , '01-07-2020', '01-11-2020' ,3,
  'a' , '02-24-2020', '03-22-2020' , 15,
  'b' , '04-08-2020', '06-07-2020', 25 
)

# Dates need to be date class
anchor_df <- anchor_df %>% 
  mutate(
    anchor_beg = as.Date(anchor_beg, "%m-%d-%Y"),
    anchor_end = as.Date(anchor_end, "%m-%d-%Y")
  )

ancillary_df <- ancillary_df %>% 
  mutate(
    anc_start = as.Date(anc_start, "%m-%d-%Y"),
    anc_end = as.Date(anc_end, "%m-%d-%Y")
  )

# Convert to data.table
anchor_dt <- as.data.table(anchor_df)
ancillary_dt <- as.data.table(ancillary_df)

We can then check whether an anchor starts after the ancillary start and ends before the ancillary ends. To ensure we match to the correct persons, we also match on person.

# Range join with data.table
anchor_dt[ancillary_dt, on = .(anchor_beg >= anc_start,
                               anchor_end <= anc_end,
                               person == person)]
#>    person anchor_beg anchor_end units
#> 1:      a 2020-01-07 2020-01-11     3
#> 2:      a 2020-02-24 2020-03-22    15
#> 3:      b 2020-04-08 2020-06-07    25

CodePudding user response:

Certainly not an elegant solution and I am sure there are better alternatives, but this seems to solve what I am trying to do.

anchor_df %>% 
    full_join(ancillary_df, by = "person") %>% 
    filter(anc_start > anchor_end) %>% 
    group_by(person, anc_start) %>% 
    mutate(rn = row_number()) %>% 
    slice(which.max(rn)) %>% 
    select(-rn) %>% 
    full_join(anchor_df, by= c("person" = "person", "anchor_beg"="anchor_beg")) %>% 
    select(person, anchor_beg, anchor_end.y, anc_start, anc_end, units) %>% 
    rename(anchor_end = anchor_end.y) %>% 
    mutate(units = ifelse(is.na(units), 0, units)) %>% 
    arrange(person, anchor_beg)
  •  Tags:  
  • r
  • Related