Using the following dput in R, I tried creating a regex to pull the second teams score:
structure(list(year = c(1984, 1984, 1984), name = c("g1", "g2",
"g3"), data = c("team 1 UNC\nteam\nteam 8 Temple 65\nteam 9 St. John's (NY) 63\nat
Charlotte",
" NC\nteam 5 Auburn 71\nteam 12 Richmond 72\nat Charlotte", " NC\nteam 4
Indiana\nteam\nteam 6 VCU 70\nteam 11 Northeastern 69\nat East Rutherford"
), team_1_seed = c("1", "5", "4"), team_1 = c("UNC\nteam\nteam",
"Auburn", "Indiana\nteam\nteam"), team_2 = c("St. John's (NY)",
"Richmond", "Northeastern"), team_2_seed = c("9", "12", "11"),
team_1_score = c("65", "71", "70"), team_2_score = c("65",
NA, "70")), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
There are three examples in the dput and I need it to pull "63", "72", and "69" respectively.
Here's what I got for the first team score that seemed to work:
df %>% mutate(team_1_score = str_match(data, ".*?\\d \\D (\\d{2,3}).*")[,2])
Unable to get something for the second team.
CodePudding user response:
library(stringr)
library(dplyr)
# Data with only the needed variable
x <- data.frame(
data = c(
"team 1 UNC\nteam\nteam 8 Temple 65\nteam 9 St. John's (NY) 63\natCharlotte",
" NC\nteam 5 Auburn 71\nteam 12 Richmond 72\nat Charlotte",
" NC\nteam 4 Indiana\nteam\nteam 6 VCU 70\nteam 11 Northeastern 69\nat East Rutherford"
)
)
# Function to get the score of the first or second team
# This only works if all the observations have the same struture as
# the example
get_team_score <- function(string, which = 1) {
sapply(string, \(x) {
scores <- str_split(x, '\n') %>%
unlist() %>%
str_subset('^team . [0-9]{2}$')
scores[which] %>%
str_extract('[0-9] $') %>%
as.numeric()
})
}
x %>%
mutate(
score_team1 = get_team_score(data, 1),
score_team2 = get_team_score(data, 2)
)
#> data
#> 1 team 1 UNC\nteam\nteam 8 Temple 65\nteam 9 St. John's (NY) 63\natCharlotte
#> 2 NC\nteam 5 Auburn 71\nteam 12 Richmond 72\nat Charlotte
#> 3 NC\nteam 4 Indiana\nteam\nteam 6 VCU 70\nteam 11 Northeastern 69\nat East Rutherford
#> score_team1 score_team2
#> 1 65 63
#> 2 71 72
#> 3 70 69
Created on 2022-03-15 by the reprex package (v2.0.1)
CodePudding user response:
You can extract a number that is located after a space and before the last line feed character:
sub(".* (\\d )\n.*", "\\1", df$data)
## => [1] "63" "72" "69"
where df
is the structure you supplied as input. Deails:
.*
- any zero or more chars, as many as possible(\d )
- Group 1 (\1
): one or more digits\n
- a newline, line feed char.*
- the rest of the string.
See the regex demo.