I am working with the R programming language. Suppose I have the following tables (note: the dates appear as factors in my problem):
table_1 = data.frame(id = c("123", "123", "125", "125"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id = as.factor(table_1$id)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id = c("123", "121", "125", "126"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id = as.factor(table_2$id)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)
I would like to perform (some type of) "join" on these two tables (for now it doesn't matter, e.g. right join, inner join, etc.) using the following conditions:
1) if table_1$id = table_2$id
AND
2) if table_1$date BETWEEN(table_2$date_2,table_2$date_3)
I found a previous question on Stackoverflow which demonstrates how to do this using the "SQLDF" library: r merge by id and date between two dates
library(sqldf)
final = sqldf("select a.*, b.*
from table_1 a left join table_2 b
on a.id = b.id and
a.date_1 between
b.date_2 and
b.date_3")
head(final)
#for some reason, this produces duplicate rows, I don't know why
id date_1 id date_2 date_3
1 123 2010-01-31 123 2009-01-31 2011-01-31
2 123 2010-01-31 123 2009-01-31 2011-01-31
3 125 2015-01-31 125 2010-01-31 2020-01-31
4 125 2018-01-31 125 2010-01-31 2020-01-31
#optional: remove duplicates
final_no_dup <- final[!duplicated(final$id),]
My Question: Is there a way to perform the above "join" using Base R? If this is not possible in Base R, can this be done in "dplyr"?
Thanks!
CodePudding user response:
You may try this way in dplyr
table_1 %>%
left_join(table_2, by = "id") %>%
mutate(across(2:4, ~as.Date(.x))) %>%
filter(date_1 <= max(date_3, date_2), date_1 >= min(date_2, date_3)) %>%
distinct()
id date_1 date_2 date_3
1 123 2010-01-31 2009-01-31 2011-01-31
2 125 2015-01-31 2010-01-31 2020-01-31
3 125 2018-01-31 2010-01-31 2020-01-31
Base R
table_3 <- merge(x = table_1, y = table_2, by = "id", all.x = TRUE)
table_3 <- table_3[table_3$date_1 <= max(table_3$date_2, table_3$date_3) && table_3$date_1 >= min(table_3$date_2,table_3$date_3)]
table_3[!duplicated(table_3),]