Home > Mobile >  How to extract phrases with word limit after specific Word?
How to extract phrases with word limit after specific Word?

Time:12-26

I have the following text, and I want to extract 5 words after a specific word from a string vector:

my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remains fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."

my_teams <- tolower(c("Brazil", "Argentina"))

I want to extract the next 5 words after the word Brazil or Argentina and then combine them as an entire string.

I have used the following script to get the exact word, but not the phrases after a specific word:

pattern <- paste(my_teams, collapse = "|")

v <- unlist(str_extract_all(tolower(my_text), pattern))

paste(v, collapse=' ')

Any suggestions would be appreciated. Thanks!

CodePudding user response:

Maybe not the best possible, but:

Split into a vector of words, remove non-word characters, lowercase (to match targets):

words <- strsplit(my_text,'\\s', perl= TRUE)[[1]] |>
    gsub(pattern = "\\W", replacement = "", perl = TRUE) |>
    tolower()

Find locations of targets, get strings, paste back together:

loc <- which(words %in% my_teams)
sapply(loc, \(i) words[(i 1):(i 5)], simplify= FALSE) |>
    sapply(paste, collapse=" ")
[1] "have failed to dislodge brazil"    "from the top of the"              
[3] "won the final within 90"           "in the last eight tournaments"    
[5] "the 1998 finalists getting beyond"

Maybe you need one more paste(., collapse = " ") at the end ?

CodePudding user response:

You can use

library(stringr)
my_text <- "The World Cup 2022 winners, Argentina, have failed to dislodge Brazil from the top of the Fifa men’s world rankings as England remain fifth in the post-Qatar standings.
Had Argentina won the final within 90 minutes, they would have taken the top spot from Brazil. In the last eight tournaments going back to USA 94, no team leading the rankings at kick-off has won the tournament, with only Brazil, the 1998 finalists, getting beyond the quarter-finals."
my_teams <- tolower(c("Brazil", "Argentina"))
pattern <- paste0("(?i)\\b(?:", paste(my_teams, collapse = "|"), ")\\s (\\S (?:\\s \\S ){4})")
res <- lapply(str_match_all(my_text, pattern), function (m) m[,2])
v <- unlist(res)
paste(v, collapse=' ')
# => [1] "from the top of the won the final within 90"

See the R demo. You can also check the regex demo. Note the use of str_match_all that keeps the captured texts.

Details:

  • (?i) - case insensitive matching on
  • \b - a word boundary
  • (?:Brazil|Argentina) - one of the countries
  • \s - one or more whitespaces
  • (\S (?:\s \S ){4}) - Group 1: one or more non-whitespaces and then four repetitions of one or more whitespaces followed with one or more non-whitespaces.

CodePudding user response:

Here is an alternative approach:

  1. transform vector to tibble
  2. use separate_rows to get one word in row
  3. create helper x with lower case word
  4. make groups starting with brazil or argentina
  5. remove group == 0
  6. get word 2 to 6 in each group
  7. finale summarise:
my_teams <- tolower(c("Brazil", "Argentina"))

library(dplyr)
library(tidyr)

tibble(my_text = my_text) %>% 
  separate_rows(my_text, sep = " ") %>% 
  mutate(x = tolower(my_text)) %>%
  group_by(group = cumsum(grepl(paste(my_teams, collapse = "|"), x))) %>% 
  filter(group > 0) %>% 
  slice(2:6) %>% 
  summarise(x = paste(my_text, collapse = " "))
 group x                                 
  <int> <chr>                             
1     1 have failed to dislodge           
2     2 from the top of the               
3     3 won the final within 90           
4     4 In the last eight tournaments     
5     5 the 1998 finalists, getting beyond
  • Related