Break Apart a String into Separate Columns R-CodePudding

I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own column. I have tried separating with a space delimiter, but that has not worked well since there are teams such as "Duke" with 1 part to their name, and teams with 2 to 3 parts to their name (Michigan State, South Dakota State, etc). There also teams with "-" dashes in their name.

Here is my data:

df <- data.frame(list(
  game_info = c(
    "12/16 7:00 PM 751 Appalachian State 752 Duke",
    "12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
    "12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts", 
    "12/16 10:00 PM 757 Dartmouth 758 Stanford"
    )
  ))

Desired output:

date  time     away_team_id  away_team_name     home_team_id home_team_name
12/16 7:00 PM    751         Appalachian State  752          Duke
12/16 7:00 PM    753         Chicago State      754          Indiana-Purdue
12/16 8:00 PM    755         Texas-Arlington    756          Oral Roberts
12/16 10:00 PM   757         Dartmouth          758          Stanford

@Jonny Phelps @doRemy

CodePudding user response：

A simple way is to use the extract from the dplyr library with a regex expression:

# Define the column names:
column_names <- c("date", "time", "away_team_id", "away_team_name", "home_team_id", "home_team_name")
# Define the regex expression:
regex_expr <- paste(
  "([0-9]{1,2}[/][0-9]{1,2})", # The date
  "([0-9]{1,2}:[0-9]{1,2} [A-Za-z]{2})", # The time
  "([0-9] )", # The away team id
  "([A-Za-z -] )", # The away team name
  "([0-9] )", # The home team id
  "([A-Za-z -] )" # The home team name
)
# Extract the columns:
df %>% extract(col = game_info, into = column_names, regex = regex_expr)

CodePudding user response：

Here's one with regex. See regex101 link for the regex explanations

regex <- "^(\\d{2}\\/\\d{2})\\s*(\\d{1,2}:\\d{2}\\s*(PM|AM))\\s*(\\d )\\s*([^\\d.] )(\\d )\\s*([^\\d.] )$"

data <- data.frame(game_info=
  "12/16 7:00 PM 751 Appalachian State 752 Duke"
  ,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
  ,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
  ,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)

out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
                "home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out

Explanation:

^(\d{2}/\d{2}) - starts with 2 digits/2 digits like 12/16. ^ is a start anchor and () are used to say we want to capture this group for plucking out

\s* - 0 or more spaces between our first group and the next

(\d{1,2}:\d{2}\s*(PM|AM)) - want 1 or 2 digits : 2 digits, then possibly a space and PM or AM

\s*(\d )\s* - spaces around any number of digits, the first id

([^\d.] ) - all non numeric characters. This will fall down if there are ever numbers in your team names. If so, find some examples and we can improve it. White space is captured afterwards so is removed later with trimws

(\d )\s* - second id and spaces

([^\d.] )$ - finally the other team name and the end sentence anchor

CodePudding user response：

Here is one alternative approach:

library(dplyr)
library(stringr)
library(tidyr)

my_pattern <- "\\b((1[0-2]|0?[1-9]):([0-5][0-9]) ([AaPp][Mm]))"

df %>% 
  mutate(date = substr(game_info, 1,5),
         time = str_extract(game_info, my_pattern),
         helper = str_remove(game_info, my_pattern), .keep="unused") %>% 
  mutate(helper = str_squish(str_remove(helper, substr(helper, 1,5)))) %>% 
  separate(helper, c("away_team_id", "away_team_name"), sep = '\\s', remove = FALSE) %>%   
  mutate(home_team_id = str_extract_all(helper, '(\\d )(?!.*\\d)'),
         home_team_name = sub(".*\\s", "", helper), .keep="unused")

   date     time away_team_id  away_team_name home_team_id home_team_name
1 12/16  7:00 PM          751     Appalachian          752           Duke
2 12/16  7:00 PM          753         Chicago          754 Indiana-Purdue
3 12/16  8:00 PM          755 Texas-Arlington          756        Roberts
4 12/16 10:00 PM          757       Dartmouth          758       Stanford

CodePudding user response：

You can try this solution requiring only simple pattern matching with [:digit:]. The one additional requirement is simply having date and time at the beginning and the character team info in between the number IDs.
Additionally you can use trimws on the split list dspl to remove unwanted TAB or similar.

Data

dat <- structure(list(game_info = c("12/16 7:00 PM 751 Appalachian State 752 Duke", 
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue", "12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts", 
"12/16 10:00 PM 757 Dartmouth 758 Stanford")), class = "data.frame", row.names = c(NA, 
-4L))

dspl <- strsplit( dat$game_info, "  " )

dat_tmp <- cbind( date=as.vector(sapply( dspl, function(x) x[1] )), 
  time=unlist( lapply( dspl, function(x) paste( x[2:3], collapse=" " ) ) ),
  away_team_id=as.vector( sapply( dspl, function(x) x[4] ) ) )

data.frame( dat_tmp, 
  away_team_name=sapply( dspl, function(x) 
    paste(x[ tail( head( grep( "[[:digit:]]", x )[3]:grep( "[[:digit:]]", x )[4], -1 ), -1 ) ], collapse=" ") ), 
  home_team_id=sapply( dspl, function(x) 
    x[ max( grep( "[[:digit:]]", x ) )] ), 
  home_team_name=sapply( dspl, function(x) 
    paste( tail( x[ max( grep( "[[:digit:]]", x ) ):length(x)], -1), collapse=" " ) ) )

   date     time away_team_id    away_team_name home_team_id home_team_name
1 12/16  7:00 PM          751 Appalachian State          752           Duke
2 12/16  7:00 PM          753     Chicago State          754 Indiana-Purdue
3 12/16  8:00 PM          755   Texas-Arlington          756   Oral Roberts
4 12/16 10:00 PM          757         Dartmouth          758       Stanford