Extracting states and counties from state-county character variable-CodePudding

I have state and county names in a single string variable states_county, and I want to split them to create two distinct variables - states and county.
Some states have two (or even three words), and the same goes for counties.

> states_county_names_df$states_county
  [1] "California San Francisco"                 
  [2] "New York Bronx"                           
  [3] "New York Kings"                           
  [4] "New York New York"                        
  [5] "New York Queens"                          
  [6] "New York Richmond"                        
  [7] "Washington King"                          
  [8] "Illinois Cook"                            
  [9] "Massachusetts Suffolk"                    
 [10] "District of Columbia District of Columbia"

Thus, I want to pull in names of all US states to create a factor variable and use a substring command to remove the states (and get counties), while also only keeping the states (to remove the counties). I am not sure how to loop each factor variable level for each string entry in my states_county variable though.

state_name<-state.name 

for(i in levels(state_name)){
  print("levels(state_name)")
  test<-str_remove_all(states_county_names_df$states_county, "levels(state_name)")
   }

Would appreciate if anyone could suggest different ways to get the result.

I currently use brute force and write a 50 lines of code to generate a state column, but still cannot get to counties as I am not sure how to "substract" the state column from states_county column.

states_county_names_df$states[grepl("California",states_county_names_df$states_county)] <- "California"
states_county_names_df$states[grepl("New York", states_county_names_df$states_county)] <- "New York"
.
.
.

CodePudding user response：

R has a built-in dataset called "state" which includes a list of US states. The list doesn't have "District of Columbia" in it but you can add that to the list, and then use the list to split your "states_county" into states and counties e.g.

library(tidyverse)

states_county_names_df <- data.frame(states_county = c(
  "California San Francisco",                
  "New York Bronx",                           
  "New York Kings",                           
  "New York New York",                        
  "New York Queens",                          
  "New York Richmond",                        
  "Washington King",                          
  "Illinois Cook",                            
  "Massachusetts Suffolk",                    
  "District of Columbia District of Columbia"
  )
)

data(state)

state.name
#>  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
#>  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
#>  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
#> [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
#> [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
#> [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
#> [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
#> [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
#> [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
#> [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
#> [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
#> [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
#> [49] "Wisconsin"      "Wyoming"
states_inc_Columbia <- c(state.name, "District of Columbia")

states_county_names_df %>%
  mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
         county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
#>                                states_county                state
#> 1                   California San Francisco           California
#> 2                             New York Bronx             New York
#> 3                             New York Kings             New York
#> 4                          New York New York             New York
#> 5                            New York Queens             New York
#> 6                          New York Richmond             New York
#> 7                            Washington King           Washington
#> 8                              Illinois Cook             Illinois
#> 9                      Massachusetts Suffolk        Massachusetts
#> 10 District of Columbia District of Columbia District of Columbia
#>                   county
#> 1          San Francisco
#> 2                  Bronx
#> 3                  Kings
#> 4               New York
#> 5                 Queens
#> 6               Richmond
#> 7                   King
#> 8                   Cook
#> 9                Suffolk
#> 10  District of Columbia

^{Created on 2022-03-16 by the reprex package (v2.0.1)}

And a base R option:

states_county_names_df <- data.frame(states_county = c(
  "California San Francisco",                
  "New York Bronx",                           
  "New York Kings",                           
  "New York New York",                        
  "New York Queens",                          
  "New York Richmond",                        
  "Washington King",                          
  "Illinois Cook",                            
  "Massachusetts Suffolk",                    
  "District of Columbia District of Columbia"
  )
)

data(state)

state.name
#>  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
#>  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
#>  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
#> [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
#> [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
#> [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
#> [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
#> [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
#> [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
#> [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
#> [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
#> [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
#> [49] "Wisconsin"      "Wyoming"
states_inc_Columbia <- c(state.name, "District of Columbia")

states_county_names_df$state <- sapply(regmatches(x = states_county_names_df$states_county,
                                                  regexec(paste(states_inc_Columbia, collapse = "|"),
                                                          states_county_names_df$states_county)),
                                       "[", 1)

states_county_names_df$county <- sub(x = states_county_names_df$states_county,
                                     pattern = paste(states_inc_Columbia, collapse = "|"),
                                     replacement = "", ignore.case = TRUE)


states_county_names_df
#>                                states_county                state
#> 1                   California San Francisco           California
#> 2                             New York Bronx             New York
#> 3                             New York Kings             New York
#> 4                          New York New York             New York
#> 5                            New York Queens             New York
#> 6                          New York Richmond             New York
#> 7                            Washington King           Washington
#> 8                              Illinois Cook             Illinois
#> 9                      Massachusetts Suffolk        Massachusetts
#> 10 District of Columbia District of Columbia District of Columbia
#>                   county
#> 1          San Francisco
#> 2                  Bronx
#> 3                  Kings
#> 4               New York
#> 5                 Queens
#> 6               Richmond
#> 7                   King
#> 8                   Cook
#> 9                Suffolk
#> 10  District of Columbia

^{Created on 2022-03-16 by the reprex package (v2.0.1)}

CodePudding user response：

In Base R:

pat <- sprintf('\\b(%s)\\b', paste(c(state.name, 'District of Columbia'), collapse = '|'))
read.csv(text=sub(pat, '\\1,', df$states_couty, TRUE), col.names = c('State', 'County'))

                 State                County
1             New York                 Bronx
2             New York                 Kings
3             New York              New York
4             New York                Queens
5             New York              Richmond
6           Washington                  King
7             Illinois                  Cook
8        Massachusetts               Suffolk
9 District of Columbia  District of Columbia

CodePudding user response：

Similar to dear @jared_mamrot answer:

library(tidyverse)

pattern <- paste(c(state.name, "District of Columbia"), collapse = "|")

states_county_names_df %>% 
  mutate(county = str_trim(str_replace(states_county, pattern, "")),
         states = str_trim(str_remove(states_county, paste(county, collapse = "|"))
         ), .keep="unused")

   county               states              
   <chr>                <chr>               
 1 San Francisco        California          
 2 Bronx                Bronx               
 3 Kings                Kings               
 4 New York             New York            
 5 Queens               Queens              
 6 Richmond             Richmond            
 7 King                 Washington          
 8 Cook                 Illinois            
 9 Suffolk              Massachusetts       
10 District of Columbia District of Columbia