Here is my two datasets:
data1 = data.frame (id =c(1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug2", "drug3", "drug4", "drug4", "drug5", "drug6", "drug7"),
date_tx=c("2014-01-21","2015-04-01","2016-03-15","2013-01-13","2014-01-02","2017-04-05","2021-07-22","2022-03-01","2016-01-28"))
data2 = data.frame (id =c(1,1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug2", "drug3", "drug4", "drug4", "drug5", "drug6", "drug7", "drug8"),
date_plan=c("2014-01-23","2015-04-01","2016-03-15","2013-03-01","2014-01-02","2017-04-05","2021-07-24","2022-03-01","2016-01-20","2016-05-05"))
I would like to do a full join using id, drug and the two dates(date_tx and date_plan). Even though I am using Date's to do the join, I would like to retain both columns. Because in cases where the two dates dont match up (i.e. first two dates), I would like to have two rows with the different dates in their respective columns.
What I am hoping to get is the following:
output = data.frame (id =c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
drug = c( "drug1", "drug1", "drug1", "drug2", "drug3", "drug3", "drug4", "drug4", "drug5", "drug5", "drug6", "drug7", "drug7", "drug8"),
date_tx=c("2014-01-21","2015-04-01",NA, "2016-03-15","2013-01-13", NA, "2014-01-02","2017-04-05","2021-07-22", NA, "2022-03-01","2016-01-28",NA,NA),
date_plan=c(NA,"2015-04-01","2014-01-23","2016-03-15", NA, "2013-03-01","2014-01-02","2017-04-05", NA, "2021-07-24", "2022-03-01", NA, "2016-01-20","2016-05-05"))
What I have tried. The below does give me the number of rows, but for the ones that dont match up, I need to be able to distinguish what date_column its coming from.
merge <- full_join(data1, data2, by=c("id"="id", "drug"="drug", "date_tx"="date_plan"))
any help would be appreciated!!
CodePudding user response:
Does creating a copy of the date column in each dataframe for merging called date_merge
get you the result you need?
data1 %>%
mutate(date_merge = date_tx) %>%
full_join(data2 %>%
mutate(date_merge = date_plan),
by=c("id", "drug", "date_merge")) %>%
select(-date_merge) %>%
arrange(id, drug)
CodePudding user response:
You could try using powerjoin
package by @moodymudskipper. You can do a full join and indicate keep = "both"
to retain your two columns of interest. The conflict
parameter with coalesce will resolve the identical column names from the 2 data.frames. I added arrange
and select
at the end so the final result would be the same as output
in the post.
library(powerjoin)
power_full_join(
data1,
data2,
by = c("id", "drug", "date_tx" = "date_plan"),
keep = "both",
conflict = coalesce_xy
) %>%
arrange(id, drug) %>%
select(id, drug, date_tx, date_plan)
Output
id drug date_tx date_plan
1 1 drug1 2014-01-21 <NA>
2 1 drug1 2015-04-01 2015-04-01
3 1 drug1 <NA> 2014-01-23
4 1 drug2 2016-03-15 2016-03-15
5 1 drug3 2013-01-13 <NA>
6 1 drug3 <NA> 2013-03-01
7 1 drug4 2014-01-02 2014-01-02
8 1 drug4 2017-04-05 2017-04-05
9 1 drug5 2021-07-22 <NA>
10 1 drug5 <NA> 2021-07-24
11 1 drug6 2022-03-01 2022-03-01
12 1 drug7 2016-01-28 <NA>
13 1 drug7 <NA> 2016-01-20
14 1 drug8 <NA> 2016-05-05