Create several columns from a complex column in R-CodePudding

Imagine dataset:

df1 <- tibble::tribble(~City,   ~Population,
"United Kingdom > Leeds",   1500000,
"Spain > Las Palmas de Gran Canaria",   200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal",    250000,
"United States > Minneapolis, MN",  700000,
"United States > Milwaukee, WI",    NA,
"United States > Milwaukee",    400000)

The same dataset for visual representation:

I would like to:

Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).

Could you, please, suggest the easiest method to do so :)

CodePudding user response：

Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by @Allen's code):

library(tidyr)
library(dplyr)
df1 %>%
  extract(City,
          into = c("Country", "City", "State"),
          regex = "([^>] ) > ([^,] ),? ?([A-Z] )?"
        ) %>%
  # as by @Allen Cameron:
  group_by(Country, City) %>%
  summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]), 
            Population = Population[!is.na(Population)])

CodePudding user response：

You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:

library(tidyverse)

df1 %>%
  separate(City, sep = " > ", into = c("Country", "City")) %>%
  separate(City, sep = ', ', into = c('City', 'State')) %>%
  group_by(Country, City) %>%
  summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]), 
            Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups:   Country [4]
#>   Country        City                       State Population
#>   <chr>          <chr>                      <chr>      <dbl>
#> 1 Canada         Montreal                   <NA>      250000
#> 2 Canada         Nanaimo                    BC        150000
#> 3 Spain          Las Palmas de Gran Canaria <NA>      200000
#> 4 United Kingdom Leeds                      <NA>     1500000
#> 5 United States  Milwaukee                  WI        400000
#> 6 United States  Minneapolis                MN        700000