I have a large dataset like the following and I'm trying to add value to 3 columns based on column Country.
Country<-c("Asia","Africa - Benin (Cotonou)",
"Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
, "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")
ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)
> mydata
> ID Country col1 col2 col3
> 1 1 Asia
> 2 2 Africa - Benin (Cotonou)
> 3 3 Europe - France (Paris)
> 4 4 Asia - China(Shanghai)
> 5 5 Europe - United Kingdom (London)
> 6 6 Europe - France (Orléans)
> 7 7 Afrique - Togo (Lomé)
> 8 8 Afrique - Sénégal (Dakar)
> 9 9 Asia - Pakistan (Rahim Yar Khan)
I tried the following but im having issue with the regular expression
library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
The result that I get is the following:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom
6 Europe - France (Orléans) Europe France Orl
7 Afrique - Togo (Lomé) Afrique Togo L
8 Afrique - Sénégal (Dakar) Afrique S n
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim
Some part are missing in column 3, row 5,6,7,8 and 9.
the result that i want is the following:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom London
6 Europe - France (Orléans) Europe France Orléans
7 Afrique - Togo (Lomé) Afrique Togo Lomé
8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
Any suggestion on how to do this?
CodePudding user response:
this is my first contribution so please forgive me if I am wrong. I did it this way, may not be the easiest way but I guess it worked:
mydata %>%
separate(col = "Country",
sep = "[\\(-]",
into = c("Col1", "Col2", "Col3"),
remove = FALSE,
fill = "right") %>%
mutate(Col3 = str_remove(Col3, "\\)"))
CodePudding user response:
Update: to remove the extra spaces we could add this line at the end of the code:
mutate(across(starts_with("col"), str_squish))
We could replace the first separator -
by (
then
we get one separtor .
Afterwords do separate
and finally remove the remaining )
library(dplyr)
library(stringr)
library(tidyr)
ID col1 col2 col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 Europe United Kingdom London
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
CodePudding user response:
tidyr::separate
will separate text into columns based on delimiter (which is by default any non alpha-numeric) so it separates on spaces by default. You can use the extra
argument to merge all the remaining text into the 3rd column like so:
mydata %>%
separate(Country,
into = c("Col1", "Col2", "Col3"),
extra = "merge")
ID Col1 Col2 Col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou)
3 3 Europe France Paris)
4 4 Asia China Shanghai)
5 5 Europe United Kingdom (London)
6 6 Europe France Orléans)
7 7 Afrique Togo Lomé)
8 8 Afrique Sénégal Dakar)
9 9 Asia Pakistan Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
However, with this we get an unnecessary ) at the end. You can either remove this via a mutate or instead of separate
use tidyr::extract
that allows extracting based on regex:
mydata %>%
extract(Country,
into = c("Col1", "Col2", "Col3"),
regex = "([[:alnum:]] ) - ([[:alnum:]] ) ?\\((.*)\\)")
ID Col1 Col2 Col3
1 1 <NA> <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 <NA> <NA> <NA>
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
CodePudding user response:
library(dplyr)
library(tidyr)
mydata %>%
separate(Country, into = c("col1", "col2", "col3"), '( - | ?\\()', remove = FALSE) %>%
mutate(col3 = gsub(')', '', col3))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> ID Country col1 col2 col3
#> 1 1 Asia Asia <NA> <NA>
#> 2 2 Africa - Benin (Cotonou) Africa Benin Cotonou
#> 3 3 Europe - France (Paris) Europe France Paris
#> 4 4 Asia - China(Shanghai) Asia China Shanghai
#> 5 5 Europe - United Kingdom (London) Europe United Kingdom London
#> 6 6 Europe - France (Orléans) Europe France Orléans
#> 7 7 Afrique - Togo (Lomé) Afrique Togo Lomé
#> 8 8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
#> 9 9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
CodePudding user response:
A data.table
solution:
require(data.table)
setDT(mydata)
splitCountry <- function( c_str ) {
vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
col1 <- vec[1]
col2 <- vec[2]
col3 <- vec[3]
return(list(col1,
col2,
col3))
}
mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]