I have state and county names in a single string variable states_county, and I want to split them to create two distinct variables - states and county.
Some states have two (or even three words), and the same goes for counties.
> states_county_names_df$states_county
[1] "California San Francisco"
[2] "New York Bronx"
[3] "New York Kings"
[4] "New York New York"
[5] "New York Queens"
[6] "New York Richmond"
[7] "Washington King"
[8] "Illinois Cook"
[9] "Massachusetts Suffolk"
[10] "District of Columbia District of Columbia"
Thus, I want to pull in names of all US states to create a factor variable and use a substring command to remove the states (and get counties), while also only keeping the states (to remove the counties). I am not sure how to loop each factor variable level for each string entry in my states_county variable though.
state_name<-state.name
for(i in levels(state_name)){
print("levels(state_name)")
test<-str_remove_all(states_county_names_df$states_county, "levels(state_name)")
}
Would appreciate if anyone could suggest different ways to get the result.
I currently use brute force and write a 50 lines of code to generate a state column, but still cannot get to counties as I am not sure how to "substract" the state column from states_county column.
states_county_names_df$states[grepl("California",states_county_names_df$states_county)] <- "California"
states_county_names_df$states[grepl("New York", states_county_names_df$states_county)] <- "New York"
.
.
.
CodePudding user response:
R has a built-in dataset called "state" which includes a list of US states. The list doesn't have "District of Columbia" in it but you can add that to the list, and then use the list to split your "states_county" into states and counties e.g.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"New York Kings",
"New York New York",
"New York Queens",
"New York Richmond",
"Washington King",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
state.name
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas"
#> [5] "California" "Colorado" "Connecticut" "Delaware"
#> [9] "Florida" "Georgia" "Hawaii" "Idaho"
#> [13] "Illinois" "Indiana" "Iowa" "Kansas"
#> [17] "Kentucky" "Louisiana" "Maine" "Maryland"
#> [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
#> [25] "Missouri" "Montana" "Nebraska" "Nevada"
#> [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
#> [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
#> [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
#> [41] "South Dakota" "Tennessee" "Texas" "Utah"
#> [45] "Vermont" "Virginia" "Washington" "West Virginia"
#> [49] "Wisconsin" "Wyoming"
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
#> states_county state
#> 1 California San Francisco California
#> 2 New York Bronx New York
#> 3 New York Kings New York
#> 4 New York New York New York
#> 5 New York Queens New York
#> 6 New York Richmond New York
#> 7 Washington King Washington
#> 8 Illinois Cook Illinois
#> 9 Massachusetts Suffolk Massachusetts
#> 10 District of Columbia District of Columbia District of Columbia
#> county
#> 1 San Francisco
#> 2 Bronx
#> 3 Kings
#> 4 New York
#> 5 Queens
#> 6 Richmond
#> 7 King
#> 8 Cook
#> 9 Suffolk
#> 10 District of Columbia
Created on 2022-03-16 by the reprex package (v2.0.1)
And a base R option:
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"New York Kings",
"New York New York",
"New York Queens",
"New York Richmond",
"Washington King",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
state.name
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas"
#> [5] "California" "Colorado" "Connecticut" "Delaware"
#> [9] "Florida" "Georgia" "Hawaii" "Idaho"
#> [13] "Illinois" "Indiana" "Iowa" "Kansas"
#> [17] "Kentucky" "Louisiana" "Maine" "Maryland"
#> [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
#> [25] "Missouri" "Montana" "Nebraska" "Nevada"
#> [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
#> [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
#> [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
#> [41] "South Dakota" "Tennessee" "Texas" "Utah"
#> [45] "Vermont" "Virginia" "Washington" "West Virginia"
#> [49] "Wisconsin" "Wyoming"
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df$state <- sapply(regmatches(x = states_county_names_df$states_county,
regexec(paste(states_inc_Columbia, collapse = "|"),
states_county_names_df$states_county)),
"[", 1)
states_county_names_df$county <- sub(x = states_county_names_df$states_county,
pattern = paste(states_inc_Columbia, collapse = "|"),
replacement = "", ignore.case = TRUE)
states_county_names_df
#> states_county state
#> 1 California San Francisco California
#> 2 New York Bronx New York
#> 3 New York Kings New York
#> 4 New York New York New York
#> 5 New York Queens New York
#> 6 New York Richmond New York
#> 7 Washington King Washington
#> 8 Illinois Cook Illinois
#> 9 Massachusetts Suffolk Massachusetts
#> 10 District of Columbia District of Columbia District of Columbia
#> county
#> 1 San Francisco
#> 2 Bronx
#> 3 Kings
#> 4 New York
#> 5 Queens
#> 6 Richmond
#> 7 King
#> 8 Cook
#> 9 Suffolk
#> 10 District of Columbia
Created on 2022-03-16 by the reprex package (v2.0.1)
CodePudding user response:
In Base R:
pat <- sprintf('\\b(%s)\\b', paste(c(state.name, 'District of Columbia'), collapse = '|'))
read.csv(text=sub(pat, '\\1,', df$states_couty, TRUE), col.names = c('State', 'County'))
State County
1 New York Bronx
2 New York Kings
3 New York New York
4 New York Queens
5 New York Richmond
6 Washington King
7 Illinois Cook
8 Massachusetts Suffolk
9 District of Columbia District of Columbia
CodePudding user response:
Similar to dear @jared_mamrot answer:
library(tidyverse)
pattern <- paste(c(state.name, "District of Columbia"), collapse = "|")
states_county_names_df %>%
mutate(county = str_trim(str_replace(states_county, pattern, "")),
states = str_trim(str_remove(states_county, paste(county, collapse = "|"))
), .keep="unused")
county states
<chr> <chr>
1 San Francisco California
2 Bronx Bronx
3 Kings Kings
4 New York New York
5 Queens Queens
6 Richmond Richmond
7 King Washington
8 Cook Illinois
9 Suffolk Massachusetts
10 District of Columbia District of Columbia