Home > Net >  Splitting character with different string lengths while keeping the positional meaning
Splitting character with different string lengths while keeping the positional meaning

Time:10-20

I have a character column that encodes a location. The location could be of the form "city, state, country" or "state, country", or "country". I'd like to split these up into three columns but to ensure that if there are less than three elements each of the values goes into the correct state or country column.

Here is my data and what I've tried:

tib <- tribble(~obs, ~location,
1, "Miami, Florida, United States",
2, "Astrakhan Oblast, Russia",
3, "Mozambique")

separate(tib, location, c("city", "state", "country"), ", ")

Result:

# A tibble: 3 × 4
    obs city             state   country      
  <dbl> <chr>            <chr>   <chr>        
1     1 Miami            Florida United States
2     2 Astrakhan Oblast Russia  NA           
3     3 Mozambique       NA      NA       

In some sense, I'd like to run the separate function in reverse order so that the result looks as follows:

# A tibble: 3 × 4
    obs city  state            country      
  <dbl> <chr> <chr>            <chr>        
1     1 Miami Florida          United States
2     2 NA    Astrakhan Oblast Russia
3     3 NA    NA               Mozambique

Update:

This is a viable option but I was hoping for something simpler:

tib %>% mutate(country = str_extract(location, "[A-Za-z ] $"), 
state = str_extract(location, "(?<=\\,)[A-Za-z ] (?=\\,)"), 
city = str_extract(location, "^[A-Za-z ] (?=\\,)"))

CodePudding user response:

For your specific example you can use the fill argument in separate() to change to filling in missing values from the left instead of the right.

tidyr::separate(tib, location, c("city", "state", "country"), ", ", fill = "left")
# A tibble: 3 x 4
    obs city  state            country      
  <dbl> <chr> <chr>            <chr>        
1     1 Miami Florida          United States
2     2 NA    Astrakhan Oblast Russia       
3     3 NA    NA               Mozambique   

CodePudding user response:

Here is an alternative way using separate_rows: This was created with the help of dear @akrun Using complete to fill groups with NA to have same length as the maximum group. First try was to use complete:

library(dplyr)
library(tidyr)
  tib %>%    
    separate_rows("location", sep = ", ") %>% 
    group_by(obs) %>% 
    mutate(new = rev(c("country", "state", "city")[row_number()])) %>% 
    ungroup %>% 
    pivot_wider(names_from = new, values_from = location)

output:

    obs city  state            country      
  <dbl> <chr> <chr>            <chr>        
1     1 Miami Florida          United States
2     2 NA    Astrakhan Oblast Russia       
3     3 NA    NA               Mozambique   
  • Related