Working with data.table's rolling join functionality, I noticed some extra rows being created in my data. I tried replicating this issue on a smaller scale:
A <- data.table(id = c(1, 2, 2, 3),
dod = as.Date(c('2022-08-01', '2022-01-01', '2022-01-01', '2022-03-01')),
sex = c('M', 'F', 'M', 'F'))
B <- data.table(id = c(1, 2, 2, 3, 4, 5),
pay_date = as.Date(c('2022-12-01', '2022-01-01', '2022-01-01', '2022-07-01', '2022-08-01', '2022-10-01')),
prem = c(100, 150, 120, 80, 160, 180))
A[, roll_date := dod]
B[, roll_date := pay_date]
Running A[B, on = .(id, roll_date), roll = T]
I was expecting the output to be a data.table of the same length as B, i.e. 6. However, the output is as follows:
id dod sex roll_date pay_date prem
1: 1 2022-08-01 M 2022-12-01 2022-12-01 100
2: 2 2022-01-01 F 2022-01-01 2022-01-01 150
3: 2 2022-01-01 M 2022-01-01 2022-01-01 150
4: 2 2022-01-01 F 2022-01-01 2022-01-01 120
5: 2 2022-01-01 M 2022-01-01 2022-01-01 120
6: 3 2022-03-01 F 2022-07-01 2022-07-01 80
7: 4 <NA> <NA> 2022-08-01 2022-08-01 160
8: 5 <NA> <NA> 2022-10-01 2022-10-01 180
In particular, R outputs all combinations of the key value id == 2 & roll_date == '2022-01-01'
. Notice that the rows in the output are all distinct.
Interestingly, as soon as there is no exact match, say we change pay_date
for id == 2
in B
B <- data.table(id = c(1, 2, 2, 3, 4, 5),
pay_date = as.Date(c('2022-12-01', '2022-02-01', '2022-02-01', '2022-07-01', '2022-08-01', '2022-10-01')),
prem = c(100, 150, 120, 80, 160, 180)),
the problem disappears.
What is the reason for this and could it happen that I get a similar problem when using the assignment by reference operator :=
within my rolling join?
Thanks in advance for any help!
EDIT
The expected output is this:
id pay_date prem roll_date id_b dod_b sex_b roll_date_b
1: 1 2022-12-01 100 2022-12-01 NA <NA> <NA> <NA>
2: 2 2022-01-01 150 2022-01-01 2 2022-01-01 F 2022-01-01
3: 2 2022-01-01 120 2022-01-01 2 2022-01-01 F 2022-01-01
4: 3 2022-07-01 80 2022-07-01 3 2022-09-01 M 2022-09-01
5: 4 2022-08-01 160 2022-08-01 NA <NA> <NA> <NA>
6: 5 2022-10-01 180 2022-10-01 NA <NA> <NA> <NA>
CodePudding user response:
This is a way, but you loss all non first rows when keys are non unique
library(data.table)
A <- data.table(id = c(1, 2, 2, 3),
dod = as.Date(c('2022-08-01', '2022-01-01', '2022-01-01', '2022-03-01')),
sex = c('M', 'F', 'M', 'F'))
B <- data.table(id = c(1, 2, 2, 3, 4, 5),
pay_date = as.Date(c('2022-12-01', '2022-01-01', '2022-01-01', '2022-07-01', '2022-08-01', '2022-10-01')),
prem = c(100, 150, 120, 80, 160, 180))
A[, roll_date := dod]
B[, roll_date := pay_date]
A[!duplicated(id,roll_date)][B, on = .(id, roll_date), roll = T]
#> id dod sex roll_date pay_date prem
#> 1: 1 2022-08-01 M 2022-12-01 2022-12-01 100
#> 2: 2 2022-01-01 F 2022-01-01 2022-01-01 150
#> 3: 2 2022-01-01 F 2022-01-01 2022-01-01 120
#> 4: 3 2022-03-01 F 2022-07-01 2022-07-01 80
#> 5: 4 <NA> <NA> 2022-08-01 2022-08-01 160
#> 6: 5 <NA> <NA> 2022-10-01 2022-10-01 180