Home > Software design >  Splitting state-county-MSA string variable
Splitting state-county-MSA string variable

Time:03-22

I have a variable below that I believe is separated by space.

 [95] "Florida          Volusia        Deltona-Daytona Beach-Ormond Beach"            
 [96] "Iowa             Polk           Des Moines"                                    
 [97] "Mississippi      Hinds          Jackson"                                       
 [98] "Georgia          Richmond       Augusta-Richmond"                              
 [99] "Ohio             Mahoning       Youngstown-Warren-Boardman"                    
[100] " Pennsylvania       Lackawanna             Scranton--Wilkes-Barre"             
[101] " Pennsylvania       Dauphin                Harrisburg-Carlisle"                
[102] " Florida            Brevard                Palm Bay-Melbourne-Titusville"      
[103] " Utah               Utah                   Provo-Orem"                         
[104] " Tennessee          Hamilton               Chattanooga"                        
[105] " North Carolina     Durham                 Durham"

I want to create three variables out of this string - state, county, and MSA. But the usual string split commands are not working. I tried stringi command too but fail to split the variable. Not sure why this is happening, as the command works on simpler strings.

> strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,"")
Error in strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,  : 
  non-character argument

> BK<-strsplit(as.character(BK_state_county_MSA$non_squished_states_county_MSA_names_df),"\\t")
> str(BK) #List of 0
 list()

> stri_split(str=BK_state_county_MSA$non_squished_states_county_MSA_names_df, regex="\\t",n=3)
list()

> BK <-stri_split_lines(BK_state_county_MSA$non_squished_states_county_MSA_names_df)
list()

CodePudding user response:

Here's an option using stri_trim_left() from stringi and separate() from tidyr:

stri_trim_left() removes strings with a leading whitespace, which occurs in your data starting at [100]. You can then separate() the strings into the three specified columns state, country and MSA, separated by one or more whitespaces (sep = " "). The last part extra = "merge" makes sure that the strings only get separated into as many columns as specified in into = c().

Data

df <- data.frame(string = c("Florida          Volusia        Deltona-Daytona Beach-Ormond Beach",
                         "Iowa             Polk           Des Moines",
                         "Mississippi      Hinds          Jackson",
                         "Georgia          Richmond       Augusta-Richmond",
                         "Ohio             Mahoning       Youngstown-Warren-Boardman",
                         " Pennsylvania       Lackawanna             Scranton--Wilkes-Barre",
                         " Pennsylvania       Dauphin                Harrisburg-Carlisle",
                         " Florida            Brevard                Palm Bay-Melbourne-Titusville",
                         " Utah               Utah                   Provo-Orem",
                         " Tennessee          Hamilton               Chattanooga",
                         " North Carolina     Durham                 Durham"))

Code

library(tidyr)
library(stringi)

df %>% mutate(string = stri_trim_left(string)) %>%
      separate(string, into = c("state", "country", "MSA"), sep = "  ", extra = "merge")

Output

          state    country                                MSA
1       Florida    Volusia Deltona-Daytona Beach-Ormond Beach
2          Iowa       Polk                         Des Moines
3   Mississippi      Hinds                            Jackson
4       Georgia   Richmond                   Augusta-Richmond
5          Ohio   Mahoning         Youngstown-Warren-Boardman
6  Pennsylvania Lackawanna             Scranton--Wilkes-Barre
7  Pennsylvania    Dauphin                Harrisburg-Carlisle
8       Florida    Brevard      Palm Bay-Melbourne-Titusville
9          Utah       Utah                         Provo-Orem
10    Tennessee   Hamilton                        Chattanooga
11        North   Carolina      Durham                 Durham

Note however, that the last row doesn't separate correctly, due to the whitespace in North Carolina.

CodePudding user response:

There's probably something useful, in terms of further data analysis, about keeping North with Carolina and New with York & etc. And it's always nice to have a one liner, but sometimes a few lines get you where is best for moving forward. Consider playing around like this:

maddening_txt <-  " North Carolina     Durham                 Durham"
strsplit(maddening_txt, split = '    ') # n space = 4L
[[1]]
[1] " North Carolina" " Durham"         ""                ""               
[5] ""                " Durham"
nchar(strsplit(maddening_txt, split = '    ')[[1]])
[1] 15  7  0  0  0  7
# you could throw in a which test for >0 for indexing here
strsplit(maddening_txt, split = '    ')[[1]][c(1,2,6)]
[1] " North Carolina" " Durham"         " Durham"
 string_vec <- strsplit(maddening_txt, split = '    ')[[1]][c(1,2,6)]
> string_vec[1]
[1] " North Carolina"
string_vec[2:3]
[1] " Durham" " Durham"

Then you can trim the annoying leading space. 4L spaces may not be optimal, but it prevents [8] Palm Bay-Melbourne from splitting were you to use 1L space, and you get up close and comfy with your data as you decide how to process all 3141 counties or counties MSA. So, a list, maddening_txt <- list(), a for loop, and see what you want to further process. A completely unacceptable answer, but it will get you to the guts of what simplifications are offering and provide more control going forward.

  • Related