Create a regex to pull team score-CodePudding

Using the following dput in R, I tried creating a regex to pull the second teams score:

structure(list(year = c(1984, 1984, 1984), name = c("g1", "g2", 
"g3"), data = c("team 1 UNC\nteam\nteam 8 Temple 65\nteam 9 St. John's (NY) 63\nat 
Charlotte", 
" NC\nteam 5 Auburn 71\nteam 12 Richmond 72\nat Charlotte", " NC\nteam 4 
Indiana\nteam\nteam 6 VCU 70\nteam 11 Northeastern 69\nat East Rutherford"
), team_1_seed = c("1", "5", "4"), team_1 = c("UNC\nteam\nteam", 
"Auburn", "Indiana\nteam\nteam"), team_2 = c("St. John's (NY)", 
"Richmond", "Northeastern"), team_2_seed = c("9", "12", "11"), 
team_1_score = c("65", "71", "70"), team_2_score = c("65", 
NA, "70")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", 
"data.frame"))

There are three examples in the dput and I need it to pull "63", "72", and "69" respectively.

Here's what I got for the first team score that seemed to work:

  df %>% mutate(team_1_score = str_match(data, ".*?\\d  \\D  (\\d{2,3}).*")[,2])

Unable to get something for the second team.

CodePudding user response：

library(stringr)
library(dplyr)

# Data with only the needed variable
x <- data.frame(
  data = c(
      "team 1 UNC\nteam\nteam 8 Temple 65\nteam 9 St. John's (NY) 63\natCharlotte",
      " NC\nteam 5 Auburn 71\nteam 12 Richmond 72\nat Charlotte", 
      " NC\nteam 4 Indiana\nteam\nteam 6 VCU 70\nteam 11 Northeastern 69\nat East Rutherford"
      )
  )

# Function to get the score of the first or second team
# This only works if all the observations have the same struture as
# the example

get_team_score <- function(string, which = 1) {
  
  sapply(string, \(x) {
    
    scores <- str_split(x, '\n') %>%
      unlist() %>% 
      str_subset('^team . [0-9]{2}$')
    
    scores[which] %>%
      str_extract('[0-9] $') %>%
      as.numeric()
    
  })
  
  
}

x %>% 
  mutate(
    score_team1 = get_team_score(data, 1),
    score_team2 = get_team_score(data, 2)
    ) 
#>                                                                                    data
#> 1            team 1 UNC\nteam\nteam 8 Temple 65\nteam 9 St. John's (NY) 63\natCharlotte
#> 2                               NC\nteam 5 Auburn 71\nteam 12 Richmond 72\nat Charlotte
#> 3  NC\nteam 4 Indiana\nteam\nteam 6 VCU 70\nteam 11 Northeastern 69\nat East Rutherford
#>   score_team1 score_team2
#> 1          65          63
#> 2          71          72
#> 3          70          69

^{Created on 2022-03-15 by the reprex package (v2.0.1)}

CodePudding user response：

You can extract a number that is located after a space and before the last line feed character:

sub(".* (\\d )\n.*", "\\1", df$data)
## => [1] "63" "72" "69"

where df is the structure you supplied as input. Deails:

.* - any zero or more chars, as many as possible
- a space
(\d ) - Group 1 (\1): one or more digits
\n - a newline, line feed char
.* - the rest of the string.

See the regex demo.