I have a variable below that I believe is separated by space.
[95] "Florida Volusia Deltona-Daytona Beach-Ormond Beach"
[96] "Iowa Polk Des Moines"
[97] "Mississippi Hinds Jackson"
[98] "Georgia Richmond Augusta-Richmond"
[99] "Ohio Mahoning Youngstown-Warren-Boardman"
[100] " Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[101] " Pennsylvania Dauphin Harrisburg-Carlisle"
[102] " Florida Brevard Palm Bay-Melbourne-Titusville"
[103] " Utah Utah Provo-Orem"
[104] " Tennessee Hamilton Chattanooga"
[105] " North Carolina Durham Durham"
I want to create three variables out of this string - state, county, and MSA. But the usual string split commands are not working. I tried stringi command too but fail to split the variable. Not sure why this is happening, as the command works on simpler strings.
> strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,"")
Error in strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df, :
non-character argument
> BK<-strsplit(as.character(BK_state_county_MSA$non_squished_states_county_MSA_names_df),"\\t")
> str(BK) #List of 0
list()
> stri_split(str=BK_state_county_MSA$non_squished_states_county_MSA_names_df, regex="\\t",n=3)
list()
> BK <-stri_split_lines(BK_state_county_MSA$non_squished_states_county_MSA_names_df)
list()
CodePudding user response:
Here's an option using stri_trim_left()
from stringi
and separate()
from tidyr
:
stri_trim_left()
removes strings with a leading whitespace, which occurs in your data starting at [100]
. You can then separate()
the strings into the three specified columns state
, country
and MSA
, separated by one or more whitespaces (sep = " "
). The last part extra = "merge"
makes sure that the strings only get separated into as many columns as specified in into = c()
.
Data
df <- data.frame(string = c("Florida Volusia Deltona-Daytona Beach-Ormond Beach",
"Iowa Polk Des Moines",
"Mississippi Hinds Jackson",
"Georgia Richmond Augusta-Richmond",
"Ohio Mahoning Youngstown-Warren-Boardman",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham"))
Code
library(tidyr)
library(stringi)
df %>% mutate(string = stri_trim_left(string)) %>%
separate(string, into = c("state", "country", "MSA"), sep = " ", extra = "merge")
Output
state country MSA
1 Florida Volusia Deltona-Daytona Beach-Ormond Beach
2 Iowa Polk Des Moines
3 Mississippi Hinds Jackson
4 Georgia Richmond Augusta-Richmond
5 Ohio Mahoning Youngstown-Warren-Boardman
6 Pennsylvania Lackawanna Scranton--Wilkes-Barre
7 Pennsylvania Dauphin Harrisburg-Carlisle
8 Florida Brevard Palm Bay-Melbourne-Titusville
9 Utah Utah Provo-Orem
10 Tennessee Hamilton Chattanooga
11 North Carolina Durham Durham
Note however, that the last row doesn't separate correctly, due to the whitespace in North Carolina.
CodePudding user response:
There's probably something useful, in terms of further data analysis, about keeping North with Carolina and New with York & etc. And it's always nice to have a one liner, but sometimes a few lines get you where is best for moving forward. Consider playing around like this:
maddening_txt <- " North Carolina Durham Durham"
strsplit(maddening_txt, split = ' ') # n space = 4L
[[1]]
[1] " North Carolina" " Durham" "" ""
[5] "" " Durham"
nchar(strsplit(maddening_txt, split = ' ')[[1]])
[1] 15 7 0 0 0 7
# you could throw in a which test for >0 for indexing here
strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
[1] " North Carolina" " Durham" " Durham"
string_vec <- strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
> string_vec[1]
[1] " North Carolina"
string_vec[2:3]
[1] " Durham" " Durham"
Then you can trim the annoying leading space. 4L spaces may not be optimal, but it prevents [8] Palm Bay-Melbourne from splitting were you to use 1L space, and you get up close and comfy with your data as you decide how to process all 3141 counties or counties MSA. So, a list, maddening_txt <- list(), a for loop, and see what you want to further process. A completely unacceptable answer, but it will get you to the guts of what simplifications are offering and provide more control going forward.