R: Adding a column to a data frame based on a column in another data frame causes several duplicate-CodePudding

I want to take the state name from one data frame, and put it in another based on the county fips code. Below is a snippet of the two

> head(arr)
  state_abb fips_state_county_code
1        CO                   8001
2        CO                   8001
3        CO                   8001
4        CO                   8001
5        CO                   8001
6        CO                   8001

> head(countyname)
    county       countyname
245   8001     Adams County
246   8003   Alamosa County
247   8005  Arapahoe County
248   8007 Archuleta County
249   8009      Baca County
250   8011      Bent County

When I use left_join, I get a ton of duplicates

xcountyname= left_join(arr, countyname, by =c("county" = "fips_state_county_code")) 

> head(xcountyname)
  county   countyname state_abb
1   8001 Adams County        CO
2   8001 Adams County        CO
3   8001 Adams County        CO
4   8001 Adams County        CO
5   8001 Adams County        CO
6   8001 Adams County        CO

It should only be adding the state column, but it has now gone from 535 rows to over 71k rows. Is there a better way to do this so that the state only fills in the existing rows of the xcountyname data frame?

Edit: arr is a subset of another dataset, that's why there are duplicates. But it still stands that I only want the rows with the state name to transfer with the existing fips codes in xcountyname

CodePudding user response：

Update:

One state abb:

xcountyname= left_join(arr, countyname, by =c("fips_state_county_code"="county")) %>% 
  distinct()

  state_abb fips_state_county_code   countyname
1        CO                   8001 Adams County

All states abb:

xcountyname= full_join(arr, countyname, by =c("fips_state_county_code"="county")) %>% 
  distinct()

  state_abb fips_state_county_code       countyname
1        CO                   8001     Adams County
2      <NA>                   8003   Alamosa County
3      <NA>                   8005  Arapahoe County
4      <NA>                   8007 Archuleta County
5      <NA>                   8009      Baca County
6      <NA>                   8011      Bent County

First answer: Do a full_join:

If you want to remove the duplicates, just add distinct() in the final line:

library(dplyr)

xcountyname= full_join(arr, countyname, by =c("fips_state_county_code"="county"))

   state_abb fips_state_county_code       countyname
1         CO                   8001     Adams County
2         CO                   8001     Adams County
3         CO                   8001     Adams County
4         CO                   8001     Adams County
5         CO                   8001     Adams County
6         CO                   8001     Adams County
7       <NA>                   8003   Alamosa County
8       <NA>                   8005  Arapahoe County
9       <NA>                   8007 Archuleta County
10      <NA>                   8009      Baca County
11      <NA>                   8011      Bent County

CodePudding user response：

You only want one state abb for each fips, so use distinct()

left_join(
  distinct(arr %>% select(state_abb, county=fips_state_county_code)),
  countyname
)