Home > other >  regex: get text between pattern nearest to the left of another pattern
regex: get text between pattern nearest to the left of another pattern

Time:12-31

I have a string txt which includes the pattern John and several countries. I also have vec_regex, a bunch of regex which match countries (but not all mentioned in text).

What I would like to get is the text between the matched country closest to the left of John and John: France text John.

I assume it's a negative lookahead which is needed, but I couldn't get it working. (see here and here). Many thanks!

library(stringr)
txt <- "Germany Russia and Germany Russia text Germany text France text John text text France and Spain"

vec_regex <- c("German\\w*", "France|French", "Spain|Spanish", "Russia\\w*")
vec_regex_or <- paste(vec_regex, collapse="|")
vec_regex_or
#> [1] "German\\w*|France|French|Spain|Spanish|Russia\\w*"

pattern_left <- paste0("(",vec_regex_or, ")",".*John")
pattern_left
#> [1] "(German\\w*|France|French|Spain|Spanish|Russia\\w*).*John"
str_extract(txt, regex(pattern_left))
#> [1] "Germany Russia and Germany Russia text Germany text France text John"

pattern_left <- paste0("(",vec_regex_or, ")","(?!(",vec_regex_or,"))",".*John") #neg. lookahead
pattern_left
#> [1] "(German\\w*|France|French|Spain|Spanish|Russia\\w*)(?!(German\\w*|France|French|Spain|Spanish|Russia\\w*)).*John"
str_extract(txt, regex(pattern_left))
#> [1] "Germany Russia and Germany Russia text Germany text France text John"

Created on 2021-12-30 by the reprex package (v2.0.1)

CodePudding user response:

You need to use

pattern_left <- paste0("(",vec_regex_or, ")","(?:(?!",vec_regex_or,").)*","John")
pattern_left
# => [1] "(German\\w*|France|French|Spain|Spanish|Russia\\w*)(?:(?!German\\w*|France|French|Spain|Spanish|Russia\\w*).)*John"
str_extract(txt, regex(pattern_left))
# => [1] "France text John"

The "(?:(?!",vec_regex_or,").)*" part creates the tempered greedy token correctly.

Also, if you plan to match these strings as whole words, consider adding word boundaries:

pattern_left <- paste0("\\b(",vec_regex_or, ")\\b","(?:(?!",vec_regex_or,").)*","John\\b")
  • Related