I am working with the R programming language. Suppose I have the following tables (note: the dates appear as factors in my problem):

table_1 = data.frame(id = c("123", "123", "125", "125"), 
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id = as.factor(table_1$id)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id = c("123", "121", "125", "126"), 
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id = as.factor(table_2$id)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I would like to perform (some type of) "join" on these two tables (for now it doesn't matter, e.g. right join, inner join, etc.) using the following conditions:

1) if table_1$id = table_2$id

AND

2) if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

I found a previous question on Stackoverflow which demonstrates how to do this using the "SQLDF" library: r merge by id and date between two dates

library(sqldf)

final = sqldf("select a.*, b.*
           
       from table_1 a left join table_2 b
       on a.id = b.id and 
          a.date_1 between 
              b.date_2 and
              b.date_3")
head(final)
#for some reason, this produces duplicate rows, I don't know why



id     date_1  id     date_2     date_3
1 123 2010-01-31 123 2009-01-31 2011-01-31
2 123 2010-01-31 123 2009-01-31 2011-01-31
3 125 2015-01-31 125 2010-01-31 2020-01-31
4 125 2018-01-31 125 2010-01-31 2020-01-31

#optional: remove duplicates
final_no_dup <- final[!duplicated(final$id),]

My Question: Is there a way to perform the above "join" using Base R? If this is not possible in Base R, can this be done in "dplyr"?

Thanks!

CodePudding user response：

You may try this way in dplyr

table_1 %>%
  left_join(table_2, by = "id") %>%
  mutate(across(2:4, ~as.Date(.x))) %>%
  filter(date_1 <= max(date_3, date_2), date_1 >= min(date_2, date_3)) %>%
  distinct()

   id     date_1     date_2     date_3
1 123 2010-01-31 2009-01-31 2011-01-31
2 125 2015-01-31 2010-01-31 2020-01-31
3 125 2018-01-31 2010-01-31 2020-01-31

Base `R`

table_3 <- merge(x = table_1, y = table_2, by = "id", all.x = TRUE)
table_3 <- table_3[table_3$date_1 <= max(table_3$date_2, table_3$date_3) && table_3$date_1 >= min(table_3$date_2,table_3$date_3)]
table_3[!duplicated(table_3),]