I have two data.tables, which I want to combine based on if a date in one table is in the given time range in the other table. In dt1 I have exit dates and I want to check in dt2 which values were valid at the exit date for each ID.
dt1 <- data.table (ID = 1:10,
exit = c("31/12/2010", "01/01/2021", "30/09/2010", "31/12/2015", "30/09/2010","31/10/2018", "01/02/2016", "01/05/2015", "01/09/2013", "01/01/2016"))
dt2 <- data.table (ID = c(1,2,2,2,3,5,6,6,7,8,8,9,10),
valid_from = c("01/01/2010", "01/01/2012", "01/01/2013", "01/12/2017", "01/05/2010", "01/04/2010", "01/05/2014", "01/11/2016", "01/01/2016", "15/04/2013", "01/01/2015", "15/02/2010", "01/04/2012"),
valid_until = c("01/01/2021", "31/12/2012", "30/11/2017", "01/01/2021", "01/01/2021", "01/01/2021", "31/10/2016", "01/01/2021", "01/01/2021", "31/12/2014", "01/05/2015", "01/01/2013", "01/01/2021"),
text1 = c("a", "a", "b", "c", "b", "b", "c", "a", "a", "b", "a", "c", "a"),
text2 = c("I", "I", "II", "I", "III", "I", "II", "III", "I", "II", "II", "I", "III" ))
ID exit
1: 1 31/12/2010
2: 2 01/01/2021
3: 3 30/09/2010
4: 4 31/12/2015
5: 5 30/09/2010
6: 6 31/10/2018
7: 7 01/02/2016
8: 8 01/05/2015
9: 9 01/09/2013
10: 10 01/01/2016
ID valid_from valid_until text1 text2
1: 1 01/01/2010 01/01/2021 a I
2: 2 01/01/2012 31/12/2012 a I
3: 2 01/01/2013 30/11/2017 b II
4: 2 01/12/2017 01/01/2021 c I
5: 3 01/05/2010 01/01/2021 b III
6: 5 01/04/2010 01/01/2021 b I
7: 6 01/05/2014 31/10/2016 c II
8: 6 01/11/2016 01/01/2021 a III
9: 7 01/01/2016 01/01/2021 a I
10: 8 15/04/2013 31/12/2014 b II
11: 8 01/01/2015 01/05/2015 a II
12: 9 15/02/2010 01/01/2013 c I
13: 10 01/04/2012 01/01/2021 a III
As a result I would like to return in dt1 the valid values to the exit dates. If an ID is not found in dt2 (would be the case for ID 4 in the sample data), it should return NA.
ID exit text1 text2
1: 1 31/12/2010 a I
2: 2 01/01/2021 c I
3: 3 30/09/2010 b III
4: 4 31/12/2015 <NA> <NA>
5: 5 30/09/2010 b I
6: 6 31/10/2018 a III
7: 7 01/02/2016 a I
8: 8 01/05/2015 a II
9: 9 01/09/2013 c I
10: 10 01/01/2016 a III
Could anyone help me solve this?
CodePudding user response:
As the input is a data.table
, consider using data.table
methods which are fast
library(data.table)
# // convert the date columns to `Date` class
dt1[, exit := as.IDate(exit, '%d/%m/%Y')]
dt2[, c('valid_from', 'valid_until') := .(as.IDate(valid_from, '%d/%m/%Y'),
as.IDate(valid_until, '%d/%m/%Y'))]
# // do a non-equi join
dt1[dt2, c('text1', 'text2') := .(i.text1, i.text2),
on = .(ID, exit >= valid_from, exit <= valid_until)]
-output
> dt1
ID exit text1 text2
1: 1 2010-12-31 a I
2: 2 2021-01-01 c I
3: 3 2010-09-30 b III
4: 4 2015-12-31 <NA> <NA>
5: 5 2010-09-30 b I
6: 6 2018-10-31 a III
7: 7 2016-02-01 a I
8: 8 2015-05-01 a II
9: 9 2013-09-01 <NA> <NA>
10: 10 2016-01-01 a III
CodePudding user response:
Here is a dplyr
solution, that was created with the help of @akrun: see here dates: Not yet implemented NAbounds=TRUE for this non-numeric and non-character type
library(dplyr)
libray(lubridate)
df1 <- left_join(dt1, dt2, by="ID") %>%
mutate(across(c(exit, valid_from, valid_until), dmy)) %>%
rowwise() %>%
mutate(match= (dplyr::between(exit, valid_from, valid_until))) %>%
group_by(ID) %>%
filter(match==max(match) | is.na(match)) %>%
select(ID, exit, text1, text2) %>%
ungroup()
output:
ID exit text1 text2
<dbl> <date> <chr> <chr>
1 1 2010-12-31 a I
2 2 2021-01-01 c I
3 3 2010-09-30 b III
4 4 2015-12-31 NA NA
5 5 2010-09-30 b I
6 6 2018-10-31 a III
7 7 2016-02-01 a I
8 8 2015-05-01 a II
9 9 2013-09-01 c I
10 10 2016-01-01 a III
CodePudding user response:
You may use fuzzyjoin
after changing the dates to Date class.
library(fuzzyjoin)
library(dplyr)
dt1 %>%
mutate(exit = as.Date(exit, '%d/%m/%Y')) %>%
fuzzy_left_join(dt2 %>%
mutate(across(starts_with('valid'), as.Date, '%d/%m/%Y')),
by = c('ID', 'exit' = 'valid_from', 'exit' = 'valid_until'),
match_fun = c(`==`, `>=`, `<=`)) %>%
select(ID = ID.x, exit, text1, text2)
# ID exit text1 text2
#1 1 2010-12-31 a I
#2 2 2021-01-01 c I
#3 3 2010-09-30 b III
#4 4 2015-12-31 <NA> <NA>
#5 5 2010-09-30 b I
#6 6 2018-10-31 a III
#7 7 2016-02-01 a I
#8 8 2015-05-01 a II
#9 9 2013-09-01 <NA> <NA>
#10 10 2016-01-01 a III