I am trying to obtain specific key word from cricket commentary, some of the keyword I am looking for are a combination of 2 to 3 word in a list so,
This is the list of keywords in am looking in the commentary
region <- c("third man", "deep fine leg", "long leg", "deep square leg", "Deep mid wicket",
"cow corner", "long on", "Deep extra cover", "Deep Cover", "Deep point",
"Deep backword point", "fly slip", "backword point", "point", "cover", "Extra covers",
"mid off", "mid on", "mid wicket", "square leg", "backword square leg", "fine leg",
"slips", "gully", "silly point", "silly mid off", "silly mid on", "short leg",
"leg gully", "leg slip")
*Pretorius to Umesh Yadav, 1 run, pitched up by Pretorius, touch slower as it has been driven along the ground to long-off
Pretorius to Chahar, SIX, that's a great shot. Pitched up by Pretorius outside off, a slower one and Chahar goes down on his knee and plays a fantastic lofted shot to clear the boundary at deep extra cover
Pretorius to Umesh Yadav, 1 run, touch fuller on off, Umesh Yadav drills it to long-off for a single*
How do I match the keyword from the commentary when there is a combination of 2 or more
words for a particular ball.
I am excepting which word from the above-mentioned list has matched with the commentary
I am using R version 4.2.1 and RStudio
CodePudding user response:
It would be best to preprocess your sentences and keywords before doing the match (i.e. convert to lowercase, remove punctuations, etc.).
For example, your sentence
Pretorius to Chahar, SIX, that's a great shot. Pitched up by Pretorius outside off, a slower one and Chahar goes down on his knee and plays a fantastic lofted shot to clear the boundary at deep extra cover
won't match something from your region
vector due to the fact that your respective value has not all characters as lowercase ones.
Not sure about your desired output but for returning the matches of each sentence, I would do something like this using dplyr
and stringr
.
library(stringr)
library(dplyr)
sentence <- data.frame(sens = c("Pretorius to Umesh Yadav, 1 run, pitched up by Pretorius, touch slower as it has been driven along the ground to long-off",
"Pretorius to Chahar, SIX, that's a great shot. Pitched up by Pretorius outside off, a slower one and Chahar goes down on his knee and plays a fantastic lofted shot to clear the boundary at deep extra cover"))
region <- c("third man", "deep fine leg", "long leg", "deep square leg", "Deep mid wicket",
"cow corner", "long on", "Deep extra cover", "Deep Cover", "Deep point",
"Deep backword point", "fly slip", "backword point", "point", "cover", "Extra covers",
"mid off", "mid on", "mid wicket", "square leg", "backword square leg", "fine leg",
"slips", "gully", "silly point", "silly mid off", "silly mid on", "short leg",
"leg gully", "leg slip")
sentence %>%
rowwise() %>%
mutate(match = paste0(str_extract_all(tolower(sens), paste0(tolower(region), collapse = "|"), simplify = TRUE), collapse = "|"))