Home > other >  R: Joins based on "OR" Statements
R: Joins based on "OR" Statements

Time:12-03

I am working with the R programming language. Suppose I have the following two tables:

table_1 = data.frame(id = c("123", "123", "125", "125"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id = as.factor(table_1$id)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id = c("123", "123", "125", "125"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id = as.factor(table_2$id)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I want to perform a "join" (any join will do, I just want to learn how to do this in general) if any of the following two conditions are met (i.e. if Condition_1 = TRUE OR Condition_2 = TRUE, THEN "join")

Condition_1

  1. if table_1$id = table_2$id

AND

  1. if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

Condition_2

  1. if table_1$id2 = table_2$id2

AND

  1. if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

What I tried already: I know how to perform both of these joins individually, for example:

library(sqldf)

#Condition_1


final = sqldf("select a.*, b.*
           
       from table_1 a left join table_2 b
       on a.id = b.id and 
          a.date_1 between 
              b.date_2 and
              b.date_3") 


#Condition_2


final_2 = sqldf("select a.*, b.*
           
       from table_1 a left join table_2 b
       on a.id2 = b.id2 and 
          a.date_1 between 
              b.date_2 and
              b.date_3") 

I can then "bind" these files ("final_3") together (and remove rows which are complete duplicates) :

final_3 = rbind(final, final_2)

final_3 = final_3[!duplicated(final_3[c(1,2,3,4,5,6,7)]),]

My Question: Instead of two separate steps, is there a way to merge both of these tables together in a single step? Can this be done using Base R or DPLYR?

Thanks!

CodePudding user response:

This can be done in a single SQL statement as follows.

library(sqldf)

sqldf("select distinct * 
  from table_1 a left join table_2 b
  on (a.date_1 between b.date_2 and b.date_3) and 
     (a.id = b.id or a.id2 = b.id2)")
  • Related