R: Combining Multiple Fuzzy Joins-CodePudding

Using the R programming language, I have the following two tables (in my actual problem, all dates are given to me in "factor" types):

table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I am trying to perform an "inner join" if either Condition 1 OR Condition 2 is true:

Condition_1

if table_1$id "fuzzy equal" table_2$id AND
if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

Condition_2

if table_1$id2 "fuzzy equal" table_2$id2

Right now, I know how to do this in 2 part:

library(dplyr)
library(fuzzyjoin)

part_1 =  stringdist_inner_join(table_1, table_2, by = "id1", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) 

part_2 =  stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) 

combine = rbind(part_1, part_2)

final = combine[!duplicated(combine[c(1,2,3,4,5,6,7)]),]

My Question

Is there a "better" way to run this join together, instead of in two separate parts?
It seems that the SQL query in "part_1" is first performing a fuzzy join on all records, and then only keeping the relevant records that satisfy the date criteria, i.e. filter(date_1 >= date_2, date_1 <= date_3) . This seems to be an inefficient way of doing things - or is this the only way possible to accomplish this task, since a fuzzy join must be run on all rows by default to see if "id" condition is satisfied, and only then the "date" condition can be satisfied?

Thanks!

CodePudding user response：

If we want to do this in a loop, loop over the variable part i.e. the by

library(purrr)
library(fuzzyjoin)
library(dplyr)
final2 <- map_dfr(c("id1", "id2"),  ~
       stringdist_inner_join(table_1, table_2, by = .x, max_dist = 2)) %>%
       distinct %>%
       arrange(across(everything()))

-checking

> all.equal(final %>% 
         arrange(across(everything())), final2)
[1] TRUE