Home > Net >  conditional left_join by id and nearest date
conditional left_join by id and nearest date

Time:10-29

Two datasets to be left joined based on conditions of their id & date's apart

A <- data.frame(id = c(1,2,3),
            application_date = as.Date(c("2010-05-08", "2012-08-08", "2013-06-23")))


B <- data.frame(id = c(1,1,2,2,3,3),
            date = as.Date(c("2009-01-02", "2009-12-24", "2011-11-11", "2012-05-20", "2013-03-21", "2013-06-05")),
            value1 = c(2500, 3000, 1200, 1900, 5500, 4500),
            value2 = c(2500, 3000, 1200, 1900, 5500, 4500),
            value3 = c(2500, 3000, 1200, 1900, 5500, 4500))

if for the same id, date difference is less or equal to 6 months, then join by nearest date, follows

library(data.table)
setDT(B)[, application_date := date]
B[A, on = .(customer, date = application_date), roll = 'nearest']

if for the same id, date difference is larger than 6 months, then fill all the joined values (value1,value2, value3) with NAs.

So how to construct and combine left_join with ifelse conditions.

CodePudding user response:

Here is a possible solution using the fuzzy_join() function from the fuzzyjoin package.

A <- data.frame(id = c(1,2,3, 3),
                application_date = as.Date(c("2010-05-08", "2012-08-08", "2013-06-23",  "2015-06-23")))
  
B <- data.frame(id = c(1,1,2,2,3,3),
                date = as.Date(c("2009-01-02", "2009-12-24", "2011-11-11", "2012-05-20", "2013-03-21", "2013-06-05")),
                value1 = c(2500, 3000, 1200, 1900, 5500, 4500),
                value2 = c(2500, 3000, 1200, 1900, 5500, 4500),
                value3 = c(2500, 3000, 1200, 1900, 5500, 4500))

library(fuzzyjoin)
library(dplyr)

#define the test
#exact match for groups, interval matching on date
test <- function(id1, id2) {
   if (class(id1) == "numeric") {
      result <- (id1 == id2)
   }
   else if (class(id1) == "Date") {
      result <-( (id1 - 182) < id2 & id2 < (id1  182) )
   }
   #print(result)
}

answer<-fuzzy_join(A, B, by=c("id"="id", "application_date"="date"), mode='left', match_fun=test)

#create a grouping variable
answer$uid <- paste(answer$id.x, answer$application_date)
#find the min date of match

answer %>% group_by(uid) %>% 
   mutate(delta= abs(application_date - date)) %>% 
   arrange(delta, .by_group = TRUE) %>% 
   slice_head(n = 1) 

Be warned I have not tested the above with all possible edge and corner cases such as multiple rows in B matching with a single row in A.

  • Related