Home > database >  R - Fuzzy Inner Join on two fields, matching to a date range
R - Fuzzy Inner Join on two fields, matching to a date range

Time:07-09

I'm fairly new to R, and have been sifting through other questions all morning trying to figure this out, but can't find anything related enough or my knowledge of R is not good enough to understand some of the suggested solutions to my problem.

I have two data frames, table A with a list of non-unique identifiers and a date, and table B with the same identifier field, and a start and end date outlining a 3 month date range. In my real data, I have 1.7m records in table A, and 1.6k records in table A (as well many other fields that i'll use for the final analysis). I am expecting the vast majority of records in table A to be unnecessary.

What I want to achieve is to join the two tables together, joining on the identifier, and then only joining if the date in table A falls inside the date range in table B. I want this as an inner join so I lose the unnecessary data.

Example tables:

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

Expected result (we lose person 3 because the date falls outside the range, person 2 has two records because they had two entries in table b with corresponding dates):

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

I've been basing my solution on this web page https://exploratory.io/note/exploratory/How-to-join-two-data-frames-with-date-ranges-moq8hEQ6, however my example adds in the requirement to also join the identifier.

My solution that logically makes sense to me based on my (limited) ability in R:

Joined <- fuzzy_inner_join(b, a, by = c("numberb"="numbera", "datex"="date1", "datex"="date2"),
                           match_fun = list("=", ">=", "<="))

However I get this error message:

Error in which(m) : argument to 'which' is not logical

Thank you in advance for any help here :)

CodePudding user response:

With backticks for the match_fun and == for the number:

library(tidyverse)
library(fuzzyjoin)

a <- data.frame(
  numbera = c("1", "2", "3", "1"),
  date1 = as.Date(c("10/04/2021", "21/06/2021", "02/10/2021", "17/02/2022"),
    format = "%d/%m/%Y"
  ),
  date2 = as.Date(c("10/07/2021", "21/09/2021", "02/01/2022", "17/05/2022"),
    format = "%d/%m/%Y"
  )
)

b <- data.frame(
  numberb = c("1", "2", "2", "3", "1"),
  datex = as.Date(c("16/05/2021", "01/08/2021", "03/08/2021", "02/09/2021", "17/03/2022"),
    format = "%d/%m/%Y"
  )
)

fuzzy_inner_join(b, a,
  by = c("numberb" = "numbera", "datex" = "date1", "datex" = "date2"),
  match_fun = list(`==`, `>=`, `<=`)
)
#>   numberb      datex numbera      date1      date2
#> 1       1 2021-05-16       1 2021-04-10 2021-07-10
#> 2       2 2021-08-01       2 2021-06-21 2021-09-21
#> 3       2 2021-08-03       2 2021-06-21 2021-09-21
#> 4       1 2022-03-17       1 2022-02-17 2022-05-17

Created on 2022-07-08 by the reprex package (v2.0.1)

CodePudding user response:

Alternate solution using data.table:

library(data.table)
dt <- merge(a,b, by.x = "numbera", by.y = "numberb")
setDT(dt)
dt[date1 <= datex & date2 >= datex]

   numbera      date1      date2      datex
1:       1 2021-04-10 2021-07-10 2021-05-16
2:       1 2022-02-17 2022-05-17 2022-03-17
3:       2 2021-06-21 2021-09-21 2021-08-01
4:       2 2021-06-21 2021-09-21 2021-08-03

CodePudding user response:

data.table supports non-equi joins that are the fastest I've found. The syntax is a bit different from the tidyverse but I think it's worth it for these more complicated joins.

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

library(data.table)

dt_a <- setDT(a) # change the data frame to a data table

dt_b <- setDT(b) # change the data frame to a data table

Joined <- dt_b[ # take table b
  dt_a, # join table a
  .(numberb, datex = x.datex, numbera, date1, date2), # selecting these columns
  on = .(numberb == numbera, datex>=date1, datex<=date2), # joining on these columns
  nomatch = NULL # remove non-matches for an inner join
] |> 
  setDF() # change it back to a data frame for comparison

identical(c, Joined) # TRUE
  • Related