Home > Blockchain >  Issue with full_join producing duplicates
Issue with full_join producing duplicates

Time:09-23

I am looking to full_join two datasets by an ID number, yet I am having the following issues:

Here is a snapshot of my data

ID= c(112,112,111,110)
code_2020= c(23,22,23,20)
df_2020 <- data.frame(ID,code_2020)

ID= c(112,111,110,109)
code_2021= c(23,23,23,23)
df_2021 <- data.frame(ID,code_2021)

I want to full_join both dataset but because in 2020 there is a person with ID as 112 appearing twice (once with code 23 and once with code 22), I get a repetition in the code_2021 with 112 and 23 appearing twice rather than once for 2021.

Here is what I get with my full join df_2020_2021<- full_join(df_2020,df_2021,by="ID")

Output I get:

ID code_2020 code_2021
112 23 23
112 22 23
111 23 23
110 20 23
109 NA 23

Output I want:

ID code_2020 code_2021
112 23 23
112 22 NA
111 23 23
110 20 23
109 NA 23

the difference is in the NA in row 2 (code_2021 of ID=112).

Edit: in terms of which 112 should be matched, I would be happy for whichever 112 to be matched (so the first 112 to be matched would work).

CodePudding user response:

Number the duplicates and include that in the join:

df_2020 %>%
  group_by(ID) %>%
  mutate(occurrence = row_number()) %>%
  ungroup() %>%
  full_join(mutate(df_2021, occurrence = 1)) %>%
  select(-occurrence)
# Joining, by = c("ID", "occurrence")
# # A tibble: 5 × 3
#      ID code_2020 code_2021
#   <dbl>     <dbl>     <dbl>
# 1   112        23        23
# 2   112        22        NA
# 3   111        23        23
# 4   110        20        23
# 5   109        NA        23
  • Related