R: extract value and insert in 3 existing columns-CodePudding

I have a large dataset like the following and I'm trying to add value to 3 columns based on column Country.

Country<-c("Asia","Africa - Benin (Cotonou)",
           "Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
           , "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")

ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)


 > mydata
>   ID                          Country         col1     col2     col3 
> 1  1                             Asia
> 2  2         Africa - Benin (Cotonou)
> 3  3          Europe - France (Paris)
> 4  4           Asia - China(Shanghai)
> 5  5 Europe - United Kingdom (London)
> 6  6        Europe - France (Orléans)
> 7  7            Afrique - Togo (Lomé)
> 8  8        Afrique - Sénégal (Dakar)
> 9  9 Asia - Pakistan (Rahim Yar Khan)

I tried the following but im having issue with the regular expression

library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")

The result that I get is the following:

ID     Country                          Col1           Col2     Col3
 1    Asia                              Asia           <NA>     <NA>
 2    Africa - Benin (Cotonou)          Africa         Benin  Cotonou
 3    Europe - France (Paris)           Europe         France  Paris
 4    Asia - China(Shanghai)            Asia           China   Shanghai
 5    Europe - United Kingdom (London)  Europe         United  Kingdom
 6    Europe - France (Orléans)         Europe         France  Orl
 7    Afrique - Togo (Lomé)             Afrique        Togo      L
 8     Afrique - Sénégal (Dakar)        Afrique        S         n
 9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan   Rahim

Some part are missing in column 3, row 5,6,7,8 and 9.

the result that i want is the following:

ID     Country                          Col1           Col2                Col3
     1    Asia                              Asia           <NA>            <NA>
     2    Africa - Benin (Cotonou)          Africa         Benin            Cotonou
     3    Europe - France (Paris)           Europe         France           Paris
     4    Asia - China(Shanghai)            Asia           China            Shanghai
     5    Europe - United Kingdom (London)  Europe         United Kingdom    London
     6    Europe - France (Orléans)         Europe         France            Orléans
     7    Afrique - Togo (Lomé)             Afrique        Togo              Lomé
     8     Afrique - Sénégal (Dakar)        Afrique        Sénégal           Dakar
     9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan          Rahim Yar Khan

Any suggestion on how to do this?

CodePudding user response：

this is my first contribution so please forgive me if I am wrong. I did it this way, may not be the easiest way but I guess it worked:

mydata %>% 
  separate(col = "Country",
           sep = "[\\(-]",
           into = c("Col1", "Col2", "Col3"),
           remove = FALSE,
           fill = "right") %>% 
  mutate(Col3 = str_remove(Col3, "\\)"))

CodePudding user response：

Update: to remove the extra spaces we could add this line at the end of the code: mutate(across(starts_with("col"), str_squish))

We could replace the first separator - by ( then we get one separtor . Afterwords do separate and finally remove the remaining )

library(dplyr)
library(stringr)
library(tidyr)

  ID    col1           col2           col3
1  1    Asia           <NA>           <NA>
2  2  Africa          Benin        Cotonou
3  3  Europe         France          Paris
4  4    Asia          China       Shanghai
5  5  Europe United Kingdom         London
6  6  Europe         France        Orléans
7  7 Afrique           Togo           Lomé
8  8 Afrique        Sénégal          Dakar
9  9    Asia       Pakistan Rahim Yar Khan

CodePudding user response：

tidyr::separate will separate text into columns based on delimiter (which is by default any non alpha-numeric) so it separates on spaces by default. You can use the extra argument to merge all the remaining text into the 3rd column like so:

mydata %>% 
    separate(Country, 
            into = c("Col1", "Col2", "Col3"),
            extra = "merge")
  ID    Col1     Col2             Col3
1  1    Asia     <NA>             <NA>
2  2  Africa    Benin         Cotonou)
3  3  Europe   France           Paris)
4  4    Asia    China        Shanghai)
5  5  Europe   United Kingdom (London)
6  6  Europe   France         Orléans)
7  7 Afrique     Togo            Lomé)
8  8 Afrique  Sénégal           Dakar)
9  9    Asia Pakistan  Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].

However, with this we get an unnecessary ) at the end. You can either remove this via a mutate or instead of separate use tidyr::extract that allows extracting based on regex:

mydata %>% 
    extract(Country, 
            into = c("Col1", "Col2", "Col3"),
            regex = "([[:alnum:]] ) - ([[:alnum:]] ) ?\\((.*)\\)")
  ID    Col1     Col2           Col3
1  1    <NA>     <NA>           <NA>
2  2  Africa    Benin        Cotonou
3  3  Europe   France          Paris
4  4    Asia    China       Shanghai
5  5    <NA>     <NA>           <NA>
6  6  Europe   France        Orléans
7  7 Afrique     Togo           Lomé
8  8 Afrique  Sénégal          Dakar
9  9    Asia Pakistan Rahim Yar Khan

CodePudding user response：

library(dplyr)
library(tidyr)

mydata %>%
  separate(Country, into = c("col1", "col2", "col3"), '( - | ?\\()', remove = FALSE) %>%
  mutate(col3 = gsub(')', '', col3))

#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#>   ID                          Country    col1           col2           col3
#> 1  1                             Asia    Asia           <NA>           <NA>
#> 2  2         Africa - Benin (Cotonou)  Africa          Benin        Cotonou
#> 3  3          Europe - France (Paris)  Europe         France          Paris
#> 4  4           Asia - China(Shanghai)    Asia          China       Shanghai
#> 5  5 Europe - United Kingdom (London)  Europe United Kingdom         London
#> 6  6        Europe - France (Orléans)  Europe         France        Orléans
#> 7  7            Afrique - Togo (Lomé) Afrique           Togo           Lomé
#> 8  8        Afrique - Sénégal (Dakar) Afrique        Sénégal          Dakar
#> 9  9 Asia - Pakistan (Rahim Yar Khan)    Asia       Pakistan Rahim Yar Khan

CodePudding user response：

A data.table solution:

require(data.table)
setDT(mydata)

splitCountry <- function( c_str ) {
  
  vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
  col1 <- vec[1]
  col2 <- vec[2]
  col3 <- vec[3]
  
  return(list(col1,
              col2,
              col3))
  
}

mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]