Home > Software engineering >  Removing unique string from text based on criteria in R
Removing unique string from text based on criteria in R

Time:09-15

I am trying to extract a specific string (in this case the name of an NFL player) from another text column in my dataframe. However, the column from which I'm extracting has multiple formats that make it impossible to have one uniform way of extracting this player. Below are a examples of the different formats:

play_1 <- "Kareem Hunt Pass From Jacoby Brissett for 1 Yd C.York extra point is GOOD, Center-C.Hughlett, Holder-C.Bojorquez."
play_2 <- "(10:03) J.Allen pass short right to G.Davis for 26 yards, TOUCHDOWN.T.Bass extra point is GOOD, Center-R.Ferguson, Holder-S.Martin."
play_3 <- "Michael Thomas 9 Yd pass from Jameis Winston (Two-Point Run Conversion Failed)"
play_4 <- "(12:58) K.Murray pass short left to M.Brown for 6 yards, TOUCHDOWN. TWO-POINT CONVERSION ATTEMPT. K.Murray pass to Z.Ertz is complete. ATTEMPT SUCCEEDS."
play_5 <- "DJ Chark Pass From Jared Goff for 22 Yds A.Seibert extra point is GOOD, Center-S.Daly, Holder-J.Fox."
play_6 <- "(3:24) P.Mahomes pass short middle to C.Edwards-Helaire for 3 yards, TOUCHDOWN.J.Reid extra point is GOOD, Center-J.Winchester, Holder-T.Townsend."
play_7 <- "Isaiah McKenzie Pass From Josh Allen for 7 Yds T.Bass extra point is GOOD, Center-R.Ferguson, Holder-S.Martin."

In each case, I want to extract the player who scored the touchdown. For play_1 it's "Kareem Hunt", for play_2 it's "G.Davis", for play_3 it's "Michael Thomas", for play_4 it's "M.Brown", for play_5 it's "DJ Chark", for play_6 it's "C.Edwards-Helaire", and for play_7 it's "Isaiah McKenzie".

This is what I've tried so far (assume that these strings are in the play_desc column of my dataframe), but I continue to run into issues with the lack of complete uniqueness and text to use:

td_scorer = case_when((play_type == "Passing Touchdown" & grepl(" Yd ",play_desc)) ~ str_trim(str_extract(play_desc, "[^\\d]*")),
                                 (play_type == "Passing Touchdown" & !grepl(" Yd ",play_desc)) ~ gsub("(. ) Pass.*", "\\1", play_desc),
                                 (play_type == "Passing Touchdown" & !grepl(" Yd ",play_desc)) ~ gsub("(. ) Pass.*", "\\1", play_desc))

Does anyone have any suggestions on how to combat this?

CodePudding user response:

This solution might 'break' when you encounter edge-cases, but it works as expected for the 7 examples listed:

library(tidyverse)

play_1 <- "Kareem Hunt Pass From Jacoby Brissett for 1 Yd C.York extra point is GOOD, Center-C.Hughlett, Holder-C.Bojorquez."
play_2 <- "(10:03) J.Allen pass short right to G.Davis for 26 yards, TOUCHDOWN.T.Bass extra point is GOOD, Center-R.Ferguson, Holder-S.Martin."
play_3 <- "Michael Thomas 9 Yd pass from Jameis Winston (Two-Point Run Conversion Failed)"
play_4 <- "(12:58) K.Murray pass short left to M.Brown for 6 yards, TOUCHDOWN. TWO-POINT CONVERSION ATTEMPT. K.Murray pass to Z.Ertz is complete. ATTEMPT SUCCEEDS."
play_5 <- "DJ Chark Pass From Jared Goff for 22 Yds A.Seibert extra point is GOOD, Center-S.Daly, Holder-J.Fox."
play_6 <- "(3:24) P.Mahomes pass short middle to C.Edwards-Helaire for 3 yards, TOUCHDOWN.J.Reid extra point is GOOD, Center-J.Winchester, Holder-T.Townsend."
play_7 <- "Isaiah McKenzie Pass From Josh Allen for 7 Yds T.Bass extra point is GOOD, Center-R.Ferguson, Holder-S.Martin."

df <- data.frame(play_desc = c(play_1, play_2, play_3, play_4, play_5, play_6, play_7))

df %>%
  mutate(name_of_player_that_scored = str_extract_all(play_desc, "(?<=to).*[A-Z][A-z-]*[ \\.]*[A-Z][A-z-]*(?=.*TOUCHDOWN)|[A-Z][A-z-]*[ \\.]*[A-Z][A-z-]*(?=.*[Pp]ass [Ff]rom)"))
#>                                                                                                                                                  play_desc
#> 1                                        Kareem Hunt Pass From Jacoby Brissett for 1 Yd C.York extra point is GOOD, Center-C.Hughlett, Holder-C.Bojorquez.
#> 2                      (10:03) J.Allen pass short right to G.Davis for 26 yards, TOUCHDOWN.T.Bass extra point is GOOD, Center-R.Ferguson, Holder-S.Martin.
#> 3                                                                           Michael Thomas 9 Yd pass from Jameis Winston (Two-Point Run Conversion Failed)
#> 4 (12:58) K.Murray pass short left to M.Brown for 6 yards, TOUCHDOWN. TWO-POINT CONVERSION ATTEMPT. K.Murray pass to Z.Ertz is complete. ATTEMPT SUCCEEDS.
#> 5                                                     DJ Chark Pass From Jared Goff for 22 Yds A.Seibert extra point is GOOD, Center-S.Daly, Holder-J.Fox.
#> 6       (3:24) P.Mahomes pass short middle to C.Edwards-Helaire for 3 yards, TOUCHDOWN.J.Reid extra point is GOOD, Center-J.Winchester, Holder-T.Townsend.
#> 7                                           Isaiah McKenzie Pass From Josh Allen for 7 Yds T.Bass extra point is GOOD, Center-R.Ferguson, Holder-S.Martin.
#>   name_of_player_that_scored
#> 1                Kareem Hunt
#> 2                    G.Davis
#> 3             Michael Thomas
#> 4                    M.Brown
#> 5                   DJ Chark
#> 6          C.Edwards-Helaire
#> 7            Isaiah McKenzie

Created on 2022-09-15 by the reprex package (v2.0.1)

Regex:

(?<=to): look-behind for the word "to"

.*[A-Z][A-z-]* match a word that begins with a single capital letter, followed by any number of letters (any case) or a "-" character

[ \\.]* then any number of spaces and/or full stops (e.g. "M.Brown" and "M. Brown")

[A-Z][A-z-]* then another capital letter followed by any number of letters (any case) or a "-" character

(?=.*TOUCHDOWN) then look ahead for the word "TOUCHDOWN"

| or

[A-Z][A-z-]*[ \\.]*[A-Z][A-z-]* match names as described above

(?=.*[Pp]ass [Ff]rom) look ahead for the words "pass from" capitalised or not.

Thanks for updating your example; if you have more example sentences, i.e. more 'edge-cases', you can edit your question again and I'll take another look at it.

  • Related