Currently when joining two datasets (of different years) I get duplicates of the second one when there are less observations in the second one than the first.
Below, ID 1 only has 1 observation in year y, but it gets repeated because the first dataset of year x has three observations in total. I don't want the duplicates, but simply NAs.
So what I currently get is this:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 A 2
3 1 C 1 A 2
What I want is:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 NA NA
3 1 C 1 NA NA
The end result is that my manager can tell in year x customer 1 ordered A, B, C in n.x quantities. In year y they only ordered A in n.y quantities.
Data:
structure(list(ID = c(1, 1, 1), Value = c("A", "B", "C"), N = c(6,
7, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
structure(list(ID = 1, Value = "A", N = 2), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -1L))
CodePudding user response:
I would do it like this:
merge(tbl_df1, tbl_df2, by = c("ID", "Value"), all = TRUE)