how can I get the pattern n-words in a string in R using stringr::str_extract

I have a list with some texts and within these texts I want to retrieve all the occurences together with the n-words (in this case, 4) after it. Here is my example:

all_terms <- c("hospital santa clara bla bla bla bla bla hospital san francisco",
               " blablabla ",
               "hospital holy mary, bla bla bla hospital 9 de julho")

all_terms %>% 
  str_extract_all("hospital.\\w ") %>%
  unlist()

[1] "hospital santa" "hospital san"   "hospital holy"  "hospital 9"

What I wanted:

[1] "hospital santa clara bla" "hospital san francisco"   "hospital holy mary"  "hospital 9 de julho"

CodePudding user response：

Try this

str_extract_all(all_terms, "hospital(\\s\\w ){1,3}")

[[1]]
[1] "hospital santa clara bla" "hospital san francisco"  

[[2]]
character(0)

[[3]]
[1] "hospital holy mary"  "hospital 9 de julho"

CodePudding user response：

  library(stringr)
all_terms <- c("hospital santa clara bla bla bla bla bla hospital san francisco",
               " blablabla ",
               "hospital holy mary, bla bla bla hospital 9 de julho")

all_terms %>% 
  str_extract_all("hospital\\s\\S*\\s\\S*\\s*\\S*\\s*\\S*") %>%
  unlist() %>% 
  str_replace_all(pattern=",.*", replacement = "")
#> [1] "hospital santa clara bla bla" "hospital san francisco"      
#> [3] "hospital holy mary"           "hospital 9 de julho"

^{Created on 2022-04-19 by the reprex package (v2.0.1)}