Home > Blockchain >  Extract a string between two words, with multiple patterns
Extract a string between two words, with multiple patterns

Time:12-20

I have a series of strings like "the appointment of XX as head", "appoints YY as head" (included in a data frame labelled "df" in a column labelled "title")

I want to extract the names XX, XY enclosed between the two different expressions.

I'm currently using the following:

df$name <- df$title %>% 
  str_extract(regex(pattern = "(?<=Appointment of).*(?= as)", ignore_case=TRUE))

However, that works with only one of the two possible patterns.

df$name <- df$title %>% 
  str_extract(regex(pattern = "(?<=Appointment of).*(?= as)"|"(?<=joins).*(?= as)", ignore_case=TRUE))

which of course does not work. How can I create multiple patterns to feed into str_extract?

Happy to provide further details if needed!

Thanks a lot

CodePudding user response:

You can use

df$name <- df$title %>% 
  str_extract(regex(pattern = "(?<=\\bAppointment of\\s|\\bjoins\\s).*?(?=\\s as\\b)", ignore_case=TRUE))

Details:

  • (?<= - start of a positive lookbehind
    • \bAppointment of\s - a word boundary (\b), Appointment of, and then a whitespace char (\s)
  • | - or
    • \bjoins\s - a whole word joins and a whitespace
  • ) - end of the lookbehind
  • .*? - any zero or more chars other than line break chars
  • (?=\s as\b) - a positive lookahead that requires one or more whitespaces, as and a word boundary immediately to the right of the current location.

Note that in stringr, the lookbehind patterns are not strictly fixed-width, you can use

"(?<=\\bAppointment of\\s{1,100}|\\bjoins\\s{1,100}).*?(?=\\s as\\b)"

where \s{1,100} can match one to a hundred whitespace chars.

CodePudding user response:

strapply can do it without using zero width constructs. Only the second capture group is returned.

library*(gsubfn)

x <- c("the appointment of XX as head", "appoints YY as head") # input
strapply(x, "(appointment of|appoints) (.*?) as head", ~ ..2, simplify = TRUE)
## [1] "XX" "YY"

or use (?:...) to specify that the first parenthesized portion is not to be a capture group:

strapply(x, "(?:appointment of|appoints) (.*?) as head", simplify = TRUE)
## [1] "XX" "YY"

Base R

In base R it could be done with sub if every component of x matches

sub(".*(appointment of|appoints) (.*?) as head.*", "\\2", x)
## [1] "XX" "YY"

or strcapture if not

proto <- data.frame(dummy = character(0), value = character(0))
strcapture("(appointment of|appoints) (.*?) as head", x, proto)[, 2]
## [1] "XX" "YY"
  • Related