Using the R programming language, I have the following two tables (in my actual problem, all dates are given to me in "factor" types):
table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
I am trying to perform an "inner join" if either Condition 1 OR Condition 2 is true:
Condition_1
if table_1$id "fuzzy equal" table_2$id AND
if table_1$date BETWEEN(table_2$date_2,table_2$date_3)
Condition_2
- if table_1$id2 "fuzzy equal" table_2$id2
Right now, I know how to do this in 2 part:
library(dplyr)
library(fuzzyjoin)
part_1 = stringdist_inner_join(table_1, table_2, by = "id1", max_dist = 2) %>%
filter(date_1 >= date_2, date_1 <= date_3)
part_2 = stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2)
combine = rbind(part_1, part_2)
final = combine[!duplicated(combine[c(1,2,3,4,5,6,7)]),]
My Question
Is there a "better" way to run this join together, instead of in two separate parts?
It seems that the SQL query in "part_1" is first performing a fuzzy join on all records, and then only keeping the relevant records that satisfy the date criteria, i.e.
filter(date_1 >= date_2, date_1 <= date_3)
. This seems to be an inefficient way of doing things - or is this the only way possible to accomplish this task, since a fuzzy join must be run on all rows by default to see if "id" condition is satisfied, and only then the "date" condition can be satisfied?
Thanks!
CodePudding user response:
If we want to do this in a loop, loop over the variable part i.e. the by
library(purrr)
library(fuzzyjoin)
library(dplyr)
final2 <- map_dfr(c("id1", "id2"), ~
stringdist_inner_join(table_1, table_2, by = .x, max_dist = 2)) %>%
distinct %>%
arrange(across(everything()))
-checking
> all.equal(final %>%
arrange(across(everything())), final2)
[1] TRUE