I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own column. I have tried separating with a space delimiter, but that has not worked well since there are teams such as "Duke" with 1 part to their name, and teams with 2 to 3 parts to their name (Michigan State, South Dakota State, etc). There also teams with "-" dashes in their name.
Here is my data:
df <- data.frame(list(
game_info = c(
"12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue",
"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford"
)
))
Desired output:
date time away_team_id away_team_name home_team_id home_team_name
12/16 7:00 PM 751 Appalachian State 752 Duke
12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
12/16 10:00 PM 757 Dartmouth 758 Stanford
@Jonny Phelps @doRemy
CodePudding user response:
A simple way is to use the extract
from the dplyr
library with a regex expression:
# Define the column names:
column_names <- c("date", "time", "away_team_id", "away_team_name", "home_team_id", "home_team_name")
# Define the regex expression:
regex_expr <- paste(
"([0-9]{1,2}[/][0-9]{1,2})", # The date
"([0-9]{1,2}:[0-9]{1,2} [A-Za-z]{2})", # The time
"([0-9] )", # The away team id
"([A-Za-z -] )", # The away team name
"([0-9] )", # The home team id
"([A-Za-z -] )" # The home team name
)
# Extract the columns:
df %>% extract(col = game_info, into = column_names, regex = regex_expr)
CodePudding user response:
Here's one with regex. See regex101 link for the regex explanations
regex <- "^(\\d{2}\\/\\d{2})\\s*(\\d{1,2}:\\d{2}\\s*(PM|AM))\\s*(\\d )\\s*([^\\d.] )(\\d )\\s*([^\\d.] )$"
data <- data.frame(game_info=
"12/16 7:00 PM 751 Appalachian State 752 Duke"
,"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue"
,"12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts"
,"12/16 10:00 AM 757 Dartmouth 758 Stanford"
)
library(stringr)
out <- do.call(rbind, str_match_all(data, regex))
out <- as.data.frame(out)
# remove full string & AM/PM
out$V1 <- NULL
out$V4 <- NULL
names(out) <- c("date", "time", "away_team_id", "away_team_name",
"home_team_id", "home_team_name")
# remove white space from end
out$away_team_name <- trimws(out$away_team_name)
out$home_team_name <- trimws(out$home_team_name)
out
Explanation:
^(\d{2}/\d{2}) - starts with 2 digits/2 digits like 12/16. ^ is a start anchor and () are used to say we want to capture this group for plucking out
\s* - 0 or more spaces between our first group and the next
(\d{1,2}:\d{2}\s*(PM|AM)) - want 1 or 2 digits : 2 digits, then possibly a space and PM or AM
\s*(\d )\s* - spaces around any number of digits, the first id
([^\d.] ) - all non numeric characters. This will fall down if there are ever numbers in your team names. If so, find some examples and we can improve it. White space is captured afterwards so is removed later with trimws
(\d )\s* - second id and spaces
([^\d.] )$ - finally the other team name and the end sentence anchor
CodePudding user response:
Here is one alternative approach:
library(dplyr)
library(stringr)
library(tidyr)
my_pattern <- "\\b((1[0-2]|0?[1-9]):([0-5][0-9]) ([AaPp][Mm]))"
df %>%
mutate(date = substr(game_info, 1,5),
time = str_extract(game_info, my_pattern),
helper = str_remove(game_info, my_pattern), .keep="unused") %>%
mutate(helper = str_squish(str_remove(helper, substr(helper, 1,5)))) %>%
separate(helper, c("away_team_id", "away_team_name"), sep = '\\s', remove = FALSE) %>%
mutate(home_team_id = str_extract_all(helper, '(\\d )(?!.*\\d)'),
home_team_name = sub(".*\\s", "", helper), .keep="unused")
date time away_team_id away_team_name home_team_id home_team_name
1 12/16 7:00 PM 751 Appalachian 752 Duke
2 12/16 7:00 PM 753 Chicago 754 Indiana-Purdue
3 12/16 8:00 PM 755 Texas-Arlington 756 Roberts
4 12/16 10:00 PM 757 Dartmouth 758 Stanford
CodePudding user response:
You can try this solution requiring only simple pattern matching with [:digit:]
. The one additional requirement is simply having date and time at the beginning and the character team info in between the number IDs.
Additionally you can use trimws
on the split list dspl
to remove unwanted TAB or similar.
Data
dat <- structure(list(game_info = c("12/16 7:00 PM 751 Appalachian State 752 Duke",
"12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue", "12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts",
"12/16 10:00 PM 757 Dartmouth 758 Stanford")), class = "data.frame", row.names = c(NA,
-4L))
dspl <- strsplit( dat$game_info, " " )
dat_tmp <- cbind( date=as.vector(sapply( dspl, function(x) x[1] )),
time=unlist( lapply( dspl, function(x) paste( x[2:3], collapse=" " ) ) ),
away_team_id=as.vector( sapply( dspl, function(x) x[4] ) ) )
data.frame( dat_tmp,
away_team_name=sapply( dspl, function(x)
paste(x[ tail( head( grep( "[[:digit:]]", x )[3]:grep( "[[:digit:]]", x )[4], -1 ), -1 ) ], collapse=" ") ),
home_team_id=sapply( dspl, function(x)
x[ max( grep( "[[:digit:]]", x ) )] ),
home_team_name=sapply( dspl, function(x)
paste( tail( x[ max( grep( "[[:digit:]]", x ) ):length(x)], -1), collapse=" " ) ) )
date time away_team_id away_team_name home_team_id home_team_name
1 12/16 7:00 PM 751 Appalachian State 752 Duke
2 12/16 7:00 PM 753 Chicago State 754 Indiana-Purdue
3 12/16 8:00 PM 755 Texas-Arlington 756 Oral Roberts
4 12/16 10:00 PM 757 Dartmouth 758 Stanford