Home > Blockchain >  R: Merge "Between" using Base R
R: Merge "Between" using Base R

Time:12-02

I am working with the R programming language. Suppose I have the following tables (note: the dates appear as factors in my problem):

table_1 = data.frame(id = c("123", "123", "125", "125"), 
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id = as.factor(table_1$id)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id = c("123", "121", "125", "126"), 
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id = as.factor(table_2$id)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I would like to perform (some type of) "join" on these two tables (for now it doesn't matter, e.g. right join, inner join, etc.) using the following conditions:

1) if table_1$id = table_2$id

AND

2) if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

I found a previous question on Stackoverflow which demonstrates how to do this using the "SQLDF" library: r merge by id and date between two dates

library(sqldf)

final = sqldf("select a.*, b.*
           
       from table_1 a left join table_2 b
       on a.id = b.id and 
          a.date_1 between 
              b.date_2 and
              b.date_3")
head(final)
#for some reason, this produces duplicate rows, I don't know why



id     date_1  id     date_2     date_3
1 123 2010-01-31 123 2009-01-31 2011-01-31
2 123 2010-01-31 123 2009-01-31 2011-01-31
3 125 2015-01-31 125 2010-01-31 2020-01-31
4 125 2018-01-31 125 2010-01-31 2020-01-31

#optional: remove duplicates
final_no_dup <- final[!duplicated(final$id),]

My Question: Is there a way to perform the above "join" using Base R? If this is not possible in Base R, can this be done in "dplyr"?

Thanks!

CodePudding user response:

You may try this way in dplyr

table_1 %>%
  left_join(table_2, by = "id") %>%
  mutate(across(2:4, ~as.Date(.x))) %>%
  filter(date_1 <= max(date_3, date_2), date_1 >= min(date_2, date_3)) %>%
  distinct()

   id     date_1     date_2     date_3
1 123 2010-01-31 2009-01-31 2011-01-31
2 125 2015-01-31 2010-01-31 2020-01-31
3 125 2018-01-31 2010-01-31 2020-01-31

Base R

table_3 <- merge(x = table_1, y = table_2, by = "id", all.x = TRUE)
table_3 <- table_3[table_3$date_1 <= max(table_3$date_2, table_3$date_3) && table_3$date_1 >= min(table_3$date_2,table_3$date_3)]
table_3[!duplicated(table_3),]
  • Related