I am working with the R programming language. Suppose I have the following two tables:
table_1 = data.frame(id = c("123", "123", "125", "125"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id = as.factor(table_1$id)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id = c("123", "123", "125", "125"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id = as.factor(table_2$id)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
I want to perform a "join" (any join will do, I just want to learn how to do this in general) if any of the following two conditions are met (i.e. if Condition_1 = TRUE OR Condition_2 = TRUE, THEN "join")
Condition_1
- if table_1$id = table_2$id
AND
- if table_1$date BETWEEN(table_2$date_2,table_2$date_3)
Condition_2
- if table_1$id2 = table_2$id2
AND
- if table_1$date BETWEEN(table_2$date_2,table_2$date_3)
What I tried already: I know how to perform both of these joins individually, for example:
library(sqldf)
#Condition_1
final = sqldf("select a.*, b.*
from table_1 a left join table_2 b
on a.id = b.id and
a.date_1 between
b.date_2 and
b.date_3")
#Condition_2
final_2 = sqldf("select a.*, b.*
from table_1 a left join table_2 b
on a.id2 = b.id2 and
a.date_1 between
b.date_2 and
b.date_3")
I can then "bind" these files ("final_3") together (and remove rows which are complete duplicates) :
final_3 = rbind(final, final_2)
final_3 = final_3[!duplicated(final_3[c(1,2,3,4,5,6,7)]),]
My Question: Instead of two separate steps, is there a way to merge both of these tables together in a single step? Can this be done using Base R or DPLYR?
Thanks!
CodePudding user response:
This can be done in a single SQL statement as follows.
library(sqldf)
sqldf("select distinct *
from table_1 a left join table_2 b
on (a.date_1 between b.date_2 and b.date_3) and
(a.id = b.id or a.id2 = b.id2)")