Full Join two datasets but keep both columns I'm matching on and add a new row when its not an-CodePudding

Here is my two datasets:


data1 = data.frame (id =c(1,1,1,1,1,1,1,1,1),
                    drug = c(    "drug1",   "drug1",      "drug2",     "drug3",     "drug4",     "drug4",     "drug5",     "drug6",      "drug7"),
                    date_tx=c("2014-01-21","2015-04-01","2016-03-15","2013-01-13","2014-01-02","2017-04-05","2021-07-22","2022-03-01","2016-01-28"))

data2 = data.frame (id =c(1,1,1,1,1,1,1,1,1,1),
                    drug = c(    "drug1",     "drug1",      "drug2",    "drug3",   "drug4",      "drug4",      "drug5",     "drug6",      "drug7",     "drug8"),
                    date_plan=c("2014-01-23","2015-04-01","2016-03-15","2013-03-01","2014-01-02","2017-04-05","2021-07-24","2022-03-01","2016-01-20","2016-05-05"))

I would like to do a full join using id, drug and the two dates(date_tx and date_plan). Even though I am using Date's to do the join, I would like to retain both columns. Because in cases where the two dates dont match up (i.e. first two dates), I would like to have two rows with the different dates in their respective columns.

What I am hoping to get is the following:

output = data.frame (id =c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
                      drug = c(   "drug1",     "drug1", "drug1", "drug2",     "drug3",   "drug3",     "drug4",    "drug4",     "drug5",      "drug5",    "drug6",    "drug7", "drug7", "drug8"),
                      date_tx=c("2014-01-21","2015-04-01",NA, "2016-03-15","2013-01-13",    NA,   "2014-01-02","2017-04-05","2021-07-22",      NA,      "2022-03-01","2016-01-28",NA,NA),
                      date_plan=c(NA,"2015-04-01","2014-01-23","2016-03-15",     NA,  "2013-03-01","2014-01-02","2017-04-05",    NA,       "2021-07-24", "2022-03-01", NA, "2016-01-20","2016-05-05"))

What I have tried. The below does give me the number of rows, but for the ones that dont match up, I need to be able to distinguish what date_column its coming from.

merge <- full_join(data1, data2, by=c("id"="id", "drug"="drug", "date_tx"="date_plan"))

any help would be appreciated!!

CodePudding user response：

Does creating a copy of the date column in each dataframe for merging called date_merge get you the result you need?

data1 %>% 
  mutate(date_merge = date_tx) %>% 
  full_join(data2 %>% 
              mutate(date_merge = date_plan), 
            by=c("id", "drug", "date_merge")) %>% 
  select(-date_merge) %>% 
  arrange(id, drug)

CodePudding user response：

You could try using powerjoin package by @moodymudskipper. You can do a full join and indicate keep = "both" to retain your two columns of interest. The conflict parameter with coalesce will resolve the identical column names from the 2 data.frames. I added arrange and select at the end so the final result would be the same as output in the post.

library(powerjoin)

power_full_join(
  data1,
  data2,
  by = c("id", "drug", "date_tx" = "date_plan"),
  keep = "both",
  conflict = coalesce_xy
) %>%
  arrange(id, drug) %>%
  select(id, drug, date_tx, date_plan)

Output

   id  drug    date_tx  date_plan
1   1 drug1 2014-01-21       <NA>
2   1 drug1 2015-04-01 2015-04-01
3   1 drug1       <NA> 2014-01-23
4   1 drug2 2016-03-15 2016-03-15
5   1 drug3 2013-01-13       <NA>
6   1 drug3       <NA> 2013-03-01
7   1 drug4 2014-01-02 2014-01-02
8   1 drug4 2017-04-05 2017-04-05
9   1 drug5 2021-07-22       <NA>
10  1 drug5       <NA> 2021-07-24
11  1 drug6 2022-03-01 2022-03-01
12  1 drug7 2016-01-28       <NA>
13  1 drug7       <NA> 2016-01-20
14  1 drug8       <NA> 2016-05-05