Home > OS >  R data.table rolling join creating extra rows
R data.table rolling join creating extra rows

Time:12-29

Working with data.table's rolling join functionality, I noticed some extra rows being created in my data. I tried replicating this issue on a smaller scale:

A <- data.table(id = c(1, 2, 2, 3), 
            dod = as.Date(c('2022-08-01', '2022-01-01', '2022-01-01', '2022-03-01')), 
            sex = c('M', 'F', 'M', 'F'))

B <- data.table(id = c(1, 2, 2, 3, 4, 5), 
            pay_date = as.Date(c('2022-12-01', '2022-01-01', '2022-01-01', '2022-07-01', '2022-08-01', '2022-10-01')), 
            prem = c(100, 150, 120, 80, 160, 180))
A[, roll_date := dod]
B[, roll_date := pay_date]

Running A[B, on = .(id, roll_date), roll = T] I was expecting the output to be a data.table of the same length as B, i.e. 6. However, the output is as follows:

   id        dod  sex  roll_date   pay_date prem
1:  1 2022-08-01    M 2022-12-01 2022-12-01  100
2:  2 2022-01-01    F 2022-01-01 2022-01-01  150
3:  2 2022-01-01    M 2022-01-01 2022-01-01  150
4:  2 2022-01-01    F 2022-01-01 2022-01-01  120
5:  2 2022-01-01    M 2022-01-01 2022-01-01  120
6:  3 2022-03-01    F 2022-07-01 2022-07-01   80
7:  4       <NA> <NA> 2022-08-01 2022-08-01  160
8:  5       <NA> <NA> 2022-10-01 2022-10-01  180

In particular, R outputs all combinations of the key value id == 2 & roll_date == '2022-01-01'. Notice that the rows in the output are all distinct.

Interestingly, as soon as there is no exact match, say we change pay_date for id == 2 in B

B <- data.table(id = c(1, 2, 2, 3, 4, 5), 
            pay_date = as.Date(c('2022-12-01', '2022-02-01', '2022-02-01', '2022-07-01', '2022-08-01', '2022-10-01')), 
            prem = c(100, 150, 120, 80, 160, 180)),

the problem disappears.

What is the reason for this and could it happen that I get a similar problem when using the assignment by reference operator := within my rolling join?

Thanks in advance for any help!

EDIT

The expected output is this:

   id   pay_date prem  roll_date id_b      dod_b sex_b roll_date_b
1:  1 2022-12-01  100 2022-12-01   NA       <NA>  <NA>        <NA>
2:  2 2022-01-01  150 2022-01-01    2 2022-01-01     F  2022-01-01
3:  2 2022-01-01  120 2022-01-01    2 2022-01-01     F  2022-01-01
4:  3 2022-07-01   80 2022-07-01    3 2022-09-01     M  2022-09-01
5:  4 2022-08-01  160 2022-08-01   NA       <NA>  <NA>        <NA>
6:  5 2022-10-01  180 2022-10-01   NA       <NA>  <NA>        <NA>

CodePudding user response:

This is a way, but you loss all non first rows when keys are non unique

library(data.table)

A <- data.table(id = c(1, 2, 2, 3), 
            dod = as.Date(c('2022-08-01', '2022-01-01', '2022-01-01', '2022-03-01')), 
            sex = c('M', 'F', 'M', 'F'))

B <- data.table(id = c(1, 2, 2, 3, 4, 5), 
            pay_date = as.Date(c('2022-12-01', '2022-01-01', '2022-01-01', '2022-07-01', '2022-08-01', '2022-10-01')), 
            prem = c(100, 150, 120, 80, 160, 180))


A[, roll_date := dod]
B[, roll_date := pay_date]


A[!duplicated(id,roll_date)][B, on = .(id, roll_date), roll = T]

#>    id        dod  sex  roll_date   pay_date prem
#> 1:  1 2022-08-01    M 2022-12-01 2022-12-01  100
#> 2:  2 2022-01-01    F 2022-01-01 2022-01-01  150
#> 3:  2 2022-01-01    F 2022-01-01 2022-01-01  120
#> 4:  3 2022-03-01    F 2022-07-01 2022-07-01   80
#> 5:  4       <NA> <NA> 2022-08-01 2022-08-01  160
#> 6:  5       <NA> <NA> 2022-10-01 2022-10-01  180
  • Related