R: outer-merge two dataframes with unequal columns-CodePudding

I'm new to coding and am struggling a bit with this merge. I have two dataframes:

> a1
  a b       c  
1 1 apple   x
2 2 bees    a
3 3 candy   a
4 4 dice    s
5 5 donut   d

> a2
  a b      c  d
1 1 apple  x  a
2 2 bees   y  d
3 6 coffee r  s

I would like to join these two dataframes by a, b, and c. I want to get rid of the duplicate rows where a, b, c are the same, but keep the unique rows in both datasets. In the case where a unique row in a2 is kept, I would also want d to be shown. So the result would be something like the following:

> a3
  a b      c  d
1 1 apple  x  a
2 2 bees   y  d
3 3 candy  a  NA
4 4 dice   s  NA
5 5 donut  d  NA
6 6 coffee r  s

CodePudding user response：

Hello there & here you go:

a3 <- merge(a1, a2, all=TRUE) # this merges your data preserving all vals from all cols/rows (if missing - adds NA)
a3 <- a3[order(a3[,"d"], decreasing=TRUE),] # this sorts your merged df
a3 <- a3[!duplicated(a3[,"b"]),] # this removes duplicate values

I edited my answer since I looked more carefully at your desired output. You wanna preserve "d" col value even tho there is a duplicate, so I first suggest to order your df based on the "d" column in decreasing fashion (so the NAs will be at the bottom of the df). Then I suggest to exclude the duplicates in col "b", and since the function preserves the first encountered value, the "apple" row taken initially from a2 df will be preserved in the output. Also, you can make an oneliner using pipe operator.

CodePudding user response：

You can use a full_join from tidyverse:

library(tidyverse)

full_join(a1, a2, by = c("a", "b", "c")) %>% 
  distinct()

Output

  a      b c    d
1 1  apple x    a
2 2   bees a <NA>
3 3  candy a <NA>
4 4   dice s <NA>
5 5  donut d <NA>
6 2   bees y    d
7 6 coffee r    s

Data

a1 <- structure(list(a = 1:5, b = c("apple", "bees", "candy", "dice", 
"donut"), c = c("x", "a", "a", "s", "d")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

a2 <- structure(list(a = c(1L, 2L, 6L), b = c("apple", "bees", "coffee"
), c = c("x", "y", "r"), d = c("a", "d", "s")), class = "data.frame", row.names = c("1", 
"2", "3"))