Home > database >  R Join without duplicates
R Join without duplicates

Time:09-13

Currently when joining two datasets (of different years) I get duplicates of the second one when there are less observations in the second one than the first.

Below, ID 1 only has 1 observation in year y, but it gets repeated because the first dataset of year x has three observations in total. I don't want the duplicates, but simply NAs.

So what I currently get is this:

ID Value.x   N.x Value.y   N.y
  <dbl> <chr>   <dbl> <chr>   <dbl>
1     1 A           6 A           2
2     1 B           7 A           2
3     1 C           1 A           2

What I want is:

ID Value.x   N.x Value.y   N.y
  <dbl> <chr>   <dbl> <chr>   <dbl>
1     1 A           6 A           2
2     1 B           7 NA           NA
3     1 C           1 NA           NA

The end result is that my manager can tell in year x customer 1 ordered A, B, C in n.x quantities. In year y they only ordered A in n.y quantities.

Data:

structure(list(ID = c(1, 1, 1), Value = c("A", "B", "C"), N = c(6, 
7, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

structure(list(ID = 1, Value = "A", N = 2), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -1L))

CodePudding user response:

I would do it like this:

merge(tbl_df1, tbl_df2, by = c("ID", "Value"), all = TRUE)
  • Related