I am looking to full_join two datasets by an ID number, yet I am having the following issues:
Here is a snapshot of my data
ID= c(112,112,111,110)
code_2020= c(23,22,23,20)
df_2020 <- data.frame(ID,code_2020)
ID= c(112,111,110,109)
code_2021= c(23,23,23,23)
df_2021 <- data.frame(ID,code_2021)
I want to full_join both dataset but because in 2020 there is a person with ID as 112 appearing twice (once with code 23 and once with code 22), I get a repetition in the code_2021 with 112 and 23 appearing twice rather than once for 2021.
Here is what I get with my full join df_2020_2021<- full_join(df_2020,df_2021,by="ID")
Output I get:
ID | code_2020 | code_2021 |
---|---|---|
112 | 23 | 23 |
112 | 22 | 23 |
111 | 23 | 23 |
110 | 20 | 23 |
109 | NA | 23 |
Output I want:
ID | code_2020 | code_2021 |
---|---|---|
112 | 23 | 23 |
112 | 22 | NA |
111 | 23 | 23 |
110 | 20 | 23 |
109 | NA | 23 |
the difference is in the NA in row 2 (code_2021 of ID=112).
Edit: in terms of which 112 should be matched, I would be happy for whichever 112 to be matched (so the first 112 to be matched would work).
CodePudding user response:
Number the duplicates and include that in the join:
df_2020 %>%
group_by(ID) %>%
mutate(occurrence = row_number()) %>%
ungroup() %>%
full_join(mutate(df_2021, occurrence = 1)) %>%
select(-occurrence)
# Joining, by = c("ID", "occurrence")
# # A tibble: 5 × 3
# ID code_2020 code_2021
# <dbl> <dbl> <dbl>
# 1 112 23 23
# 2 112 22 NA
# 3 111 23 23
# 4 110 20 23
# 5 109 NA 23