Separate a column with numeric and character into two columns using separate()-CodePudding

I have the following dataframe Teams :

Team               Year
012 Hortney        2017
012 Hortney        2018
013 James          2017
013 James          2018
014 Ilωus hero     2017
014 Ilωus hero     2018
015 Hortna         2017
015 Hortna         2018
016 Exclus race    2017
#with 25 000 more rows

And would like to transform it into the below df:

code    name         Year
012   Hortney        2017
012   Hortney        2018
013   James          2017
013   James          2018
014   Ilωus hero     2017
014   Ilωus hero     2018
015   Hortna         2017
015   Hortna         2018
016   Exclus race    2017
#with 25 000 more rows

I've tried this code separate(Team, c("code", "name")) but it makes the names of data strange (especially the Greek letter (ω), where everything after it disappears, and I must have ω intact for later coding. The last part of the name disappears as well in Exclus. Like this: (within brackets what I'm looking for)

code   name          Year
012   Hortney        2017
012   Hortney        2018
013   James          2018
014   Il             2017   (Ilωus hero)
014   Il             2018   (Ilωus hero)
015   Hortna         2017
015   Hortna         2018
016   Exclus         2017   (Exclus race)
#With 25 00 more rows

Anyone have any ideas?

CodePudding user response：

Try this

library(dplyr)

Teams |> 
      mutate(code = gsub("\\D" , "" , Team) ,
      name = trimws(gsub("\\d" , "" , Team))) |>
      select(code , name , Year)

output

  code        name Year
1  012     Hortney 2017
2  012     Hortney 2018
3  013       James 2017
4  013       James 2018
5  014  Ilωus hero 2017
6  014  Ilωus hero 2018
7  015      Hortna 2017
8  015      Hortna 2018
9  016 Exclus race 2017

CodePudding user response：

With some regex and the stringr package:

require(stringr)

Teams = data.frame(Team = c(

"012 Hortney", " 013  James ", " 018 Alain Philippe have a very long name"),  

Year = c( 2017,  2018, 2017) ) # The data for reproducible example

dplyr::mutate(Teams, 

           Team = str_squish(Team), # Supress the unwanted space in variable Team

           code = str_extract(Team, "[0-9]*"), 
# Extract the first successive digits in the variable Team

           name = str_extract(Team, "[:alpha:] [ ?[:alpha:]]*") ) %>%
# Extract the first successive letters of the variable Team, possibly with a space between the letters.

dplyr::select(code, name, Year)

Output

 code                                 name Year
1  012                              Hortney 2017
2  013                                James 2018
3  018 Alain Philippe have a very long name 2017

Please note that the variable 'code' will be exclusively composed of the first digits in the variable'Team', with no "." or "-" between the numbers. For example a number like '01.18' before the team name will result in a 'code' variable with the value '01': the regex will be blocked by the character ('.') in the team number.