Home > Software engineering >  Joining dates are changed to certain key in data.table
Joining dates are changed to certain key in data.table

Time:12-12

I have the following dataframes df and df_dates (dput below):

> df
   group      start        end
1      A 2022-12-01 2022-12-04
2      A 2022-12-04 2022-12-07
3      A 2022-12-07 2022-12-10
4      A 2022-12-10 2022-12-13
5      A 2022-12-13 2022-12-16
6      A 2022-12-16 2022-12-19
7      B 2022-12-01 2022-12-04
8      B 2022-12-04 2022-12-07
9      B 2022-12-07 2022-12-10
10     B 2022-12-10 2022-12-13
11     B 2022-12-13 2022-12-16
12     B 2022-12-16 2022-12-19
> df_dates
  group       date value
1     A 2022-12-02     1
2     A 2022-12-14     3
3     B 2022-12-06     2
4     B 2022-12-13     4

I would like to join the rows of df_dates by group where the date column is between the column start and end of df. When I join these two dataframes, the dates of the date column return the same as the dates from the start column of df. Here is the code with output:

df <- data.frame(group = rep(c('A', 'B'), each = 6),
                 start = c(seq.Date(as.Date('2022-12-01'), as.Date('2022-12-16'), '3 days')),
                 end = c(seq.Date(as.Date('2022-12-04'), as.Date('2022-12-19'), '3 days')))
df_dates <- data.frame(group = c('A', 'A', 'B', 'B'),
                       date = as.Date(c('2022-12-02', '2022-12-14', '2022-12-06', '2022-12-13')),
                       value = c(1,3,2,4))
library(data.table)
setDT(df)
setDT(df_dates)
df_dates[df, 
         .(group, date, start, end, value), 
         on = .(group, date >= start, date <= end)]
#>     group       date      start        end value
#>  1:     A 2022-12-01 2022-12-01 2022-12-04     1
#>  2:     A 2022-12-04 2022-12-04 2022-12-07    NA
#>  3:     A 2022-12-07 2022-12-07 2022-12-10    NA
#>  4:     A 2022-12-10 2022-12-10 2022-12-13    NA
#>  5:     A 2022-12-13 2022-12-13 2022-12-16     3
#>  6:     A 2022-12-16 2022-12-16 2022-12-19    NA
#>  7:     B 2022-12-01 2022-12-01 2022-12-04    NA
#>  8:     B 2022-12-04 2022-12-04 2022-12-07     2
#>  9:     B 2022-12-07 2022-12-07 2022-12-10    NA
#> 10:     B 2022-12-10 2022-12-10 2022-12-13     4
#> 11:     B 2022-12-13 2022-12-13 2022-12-16     4
#> 12:     B 2022-12-16 2022-12-16 2022-12-19    NA

Created on 2022-12-12 with reprex v2.0.2

As you can see, the dates of the date column are now changed to the dates of the start column, while I want them to be the same as in df_dates dataframe. The desired output should look like this:

#>     group       date      start        end value
#>  1:     A 2022-12-02 2022-12-01 2022-12-04     1
#>  2:     A         NA 2022-12-04 2022-12-07    NA
#>  3:     A         NA 2022-12-07 2022-12-10    NA
#>  4:     A         NA 2022-12-10 2022-12-13    NA
#>  5:     A 2022-12-14 2022-12-13 2022-12-16     3
#>  6:     A         NA 2022-12-16 2022-12-19    NA
#>  7:     B         NA 2022-12-01 2022-12-04    NA
#>  8:     B 2022-12-06 2022-12-04 2022-12-07     2
#>  9:     B         NA 2022-12-07 2022-12-10    NA
#> 10:     B 2022-12-13 2022-12-10 2022-12-13     4
#> 11:     B 2022-12-13 2022-12-13 2022-12-16     4
#> 12:     B         NA 2022-12-16 2022-12-19    NA

So I was wondering if anyone knows how to join these two dataframes in the right way using data.table?


dput of df and df_dates:

df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B"), start = structure(c(19327, 19330, 19333, 
19336, 19339, 19342, 19327, 19330, 19333, 19336, 19339, 19342
), class = "Date"), end = structure(c(19330, 19333, 19336, 19339, 
19342, 19345, 19330, 19333, 19336, 19339, 19342, 19345), class = "Date")), class = "data.frame", row.names = c(NA, 
-12L))

df_dates <- structure(list(group = c("A", "A", "B", "B"), date = structure(c(19328, 
19340, 19332, 19339), class = "Date"), value = c(1, 3, 2, 4)), class = "data.frame", row.names = c(NA, 
-4L))

CodePudding user response:

library(data.table)
setDT(df)
setDT(df_dates)

Just specifying you want to the original date (you can use prefix x. or i. (to reference df in this case) )

df_dates[df, 
         .(group, x.date, start, end, value), 
         on = .(group, date >= start, date <= end)]

Or modifying the original:

df[, c("date", "value") := 
       df_dates[.SD, on = .(group, date >= start, date <= end), .(x.date, value)]]
#      group      start        end       date value
#     <char>     <Date>     <Date>     <Date> <num>
#  1:      A 2022-12-01 2022-12-04 2022-12-02     1
#  2:      A 2022-12-04 2022-12-07       <NA>    NA
#  3:      A 2022-12-07 2022-12-10       <NA>    NA
#  4:      A 2022-12-10 2022-12-13       <NA>    NA
#  5:      A 2022-12-13 2022-12-16 2022-12-14     3
#  6:      A 2022-12-16 2022-12-19       <NA>    NA
#  7:      B 2022-12-01 2022-12-04       <NA>    NA
#  8:      B 2022-12-04 2022-12-07 2022-12-06     2
#  9:      B 2022-12-07 2022-12-10       <NA>    NA
# 10:      B 2022-12-10 2022-12-13 2022-12-13     4
# 11:      B 2022-12-13 2022-12-16 2022-12-13     4
# 12:      B 2022-12-16 2022-12-19       <NA>    NA
  • Related