How to combine two data.tables based on multiple criteria in R?-CodePudding

I have two data.tables, which I want to combine based on if a date in one table is in the given time range in the other table. In dt1 I have exit dates and I want to check in dt2 which values were valid at the exit date for each ID.

dt1 <- data.table (ID = 1:10,
                  exit = c("31/12/2010", "01/01/2021", "30/09/2010", "31/12/2015", "30/09/2010","31/10/2018", "01/02/2016", "01/05/2015", "01/09/2013", "01/01/2016"))

dt2 <- data.table (ID = c(1,2,2,2,3,5,6,6,7,8,8,9,10),
                   valid_from = c("01/01/2010", "01/01/2012", "01/01/2013", "01/12/2017", "01/05/2010", "01/04/2010", "01/05/2014", "01/11/2016", "01/01/2016", "15/04/2013", "01/01/2015", "15/02/2010", "01/04/2012"),
                   valid_until = c("01/01/2021", "31/12/2012", "30/11/2017", "01/01/2021", "01/01/2021", "01/01/2021", "31/10/2016", "01/01/2021", "01/01/2021", "31/12/2014", "01/05/2015", "01/01/2013", "01/01/2021"),
                   text1 = c("a", "a", "b", "c", "b", "b", "c", "a", "a", "b", "a", "c", "a"),
                   text2 = c("I", "I", "II", "I", "III", "I", "II", "III", "I", "II", "II", "I", "III" ))

    ID       exit
 1:  1 31/12/2010
 2:  2 01/01/2021
 3:  3 30/09/2010
 4:  4 31/12/2015
 5:  5 30/09/2010
 6:  6 31/10/2018
 7:  7 01/02/2016
 8:  8 01/05/2015
 9:  9 01/09/2013
10: 10 01/01/2016

    ID valid_from valid_until text1 text2
 1:  1 01/01/2010  01/01/2021     a     I
 2:  2 01/01/2012  31/12/2012     a     I
 3:  2 01/01/2013  30/11/2017     b    II
 4:  2 01/12/2017  01/01/2021     c     I
 5:  3 01/05/2010  01/01/2021     b   III
 6:  5 01/04/2010  01/01/2021     b     I
 7:  6 01/05/2014  31/10/2016     c    II
 8:  6 01/11/2016  01/01/2021     a   III
 9:  7 01/01/2016  01/01/2021     a     I
10:  8 15/04/2013  31/12/2014     b    II
11:  8 01/01/2015  01/05/2015     a    II
12:  9 15/02/2010  01/01/2013     c     I
13: 10 01/04/2012  01/01/2021     a   III

As a result I would like to return in dt1 the valid values to the exit dates. If an ID is not found in dt2 (would be the case for ID 4 in the sample data), it should return NA.

     ID       exit text1 text2
 1:  1 31/12/2010     a     I
 2:  2 01/01/2021     c     I
 3:  3 30/09/2010     b   III
 4:  4 31/12/2015  <NA>  <NA>
 5:  5 30/09/2010     b     I
 6:  6 31/10/2018     a   III
 7:  7 01/02/2016     a     I
 8:  8 01/05/2015     a    II
 9:  9 01/09/2013     c     I
10: 10 01/01/2016     a   III

Could anyone help me solve this?

CodePudding user response：

As the input is a data.table, consider using data.table methods which are fast

library(data.table)
# // convert the date columns to `Date` class
dt1[, exit := as.IDate(exit, '%d/%m/%Y')]
dt2[, c('valid_from', 'valid_until') := .(as.IDate(valid_from, '%d/%m/%Y'), 
       as.IDate(valid_until, '%d/%m/%Y'))]
# // do a non-equi join
 dt1[dt2, c('text1', 'text2') := .(i.text1, i.text2),
     on = .(ID, exit >= valid_from, exit <= valid_until)]

-output

> dt1
    ID       exit text1 text2
 1:  1 2010-12-31     a     I
 2:  2 2021-01-01     c     I
 3:  3 2010-09-30     b   III
 4:  4 2015-12-31  <NA>  <NA>
 5:  5 2010-09-30     b     I
 6:  6 2018-10-31     a   III
 7:  7 2016-02-01     a     I
 8:  8 2015-05-01     a    II
 9:  9 2013-09-01  <NA>  <NA>
10: 10 2016-01-01     a   III

CodePudding user response：

Here is a dplyr solution, that was created with the help of @akrun: see here dates: Not yet implemented NAbounds=TRUE for this non-numeric and non-character type

library(dplyr)
libray(lubridate)

df1 <- left_join(dt1, dt2, by="ID") %>% 
  mutate(across(c(exit, valid_from, valid_until), dmy)) %>% 
  rowwise() %>% 
  mutate(match=  (dplyr::between(exit, valid_from, valid_until))) %>% 
  group_by(ID) %>% 
  filter(match==max(match) | is.na(match)) %>% 
  select(ID, exit, text1, text2) %>% 
  ungroup()

output:

      ID exit       text1 text2
   <dbl> <date>     <chr> <chr>
 1     1 2010-12-31 a     I    
 2     2 2021-01-01 c     I    
 3     3 2010-09-30 b     III  
 4     4 2015-12-31 NA    NA   
 5     5 2010-09-30 b     I    
 6     6 2018-10-31 a     III  
 7     7 2016-02-01 a     I    
 8     8 2015-05-01 a     II   
 9     9 2013-09-01 c     I    
10    10 2016-01-01 a     III

CodePudding user response：

You may use fuzzyjoin after changing the dates to Date class.

library(fuzzyjoin)
library(dplyr)

dt1 %>%
  mutate(exit = as.Date(exit, '%d/%m/%Y')) %>%
  fuzzy_left_join(dt2 %>%
                    mutate(across(starts_with('valid'), as.Date, '%d/%m/%Y')), 
                  by = c('ID', 'exit' = 'valid_from', 'exit' = 'valid_until'),
                  match_fun = c(`==`, `>=`, `<=`)) %>%
  select(ID = ID.x, exit, text1, text2)

#   ID       exit text1 text2
#1   1 2010-12-31     a     I
#2   2 2021-01-01     c     I
#3   3 2010-09-30     b   III
#4   4 2015-12-31  <NA>  <NA>
#5   5 2010-09-30     b     I
#6   6 2018-10-31     a   III
#7   7 2016-02-01     a     I
#8   8 2015-05-01     a    II
#9   9 2013-09-01  <NA>  <NA>
#10 10 2016-01-01     a   III