I have a series of strings like "the appointment of XX as head", "appoints YY as head" (included in a data frame labelled "df" in a column labelled "title")
I want to extract the names XX, XY enclosed between the two different expressions.
I'm currently using the following:
df$name <- df$title %>%
str_extract(regex(pattern = "(?<=Appointment of).*(?= as)", ignore_case=TRUE))
However, that works with only one of the two possible patterns.
df$name <- df$title %>%
str_extract(regex(pattern = "(?<=Appointment of).*(?= as)"|"(?<=joins).*(?= as)", ignore_case=TRUE))
which of course does not work. How can I create multiple patterns to feed into str_extract?
Happy to provide further details if needed!
Thanks a lot
CodePudding user response:
You can use
df$name <- df$title %>%
str_extract(regex(pattern = "(?<=\\bAppointment of\\s|\\bjoins\\s).*?(?=\\s as\\b)", ignore_case=TRUE))
Details:
(?<=
- start of a positive lookbehind\bAppointment of\s
- a word boundary (\b
),Appointment of
, and then a whitespace char (\s
)
|
- or\bjoins\s
- a whole wordjoins
and a whitespace
)
- end of the lookbehind.*?
- any zero or more chars other than line break chars(?=\s as\b)
- a positive lookahead that requires one or more whitespaces,as
and a word boundary immediately to the right of the current location.
Note that in stringr
, the lookbehind patterns are not strictly fixed-width, you can use
"(?<=\\bAppointment of\\s{1,100}|\\bjoins\\s{1,100}).*?(?=\\s as\\b)"
where \s{1,100}
can match one to a hundred whitespace chars.
CodePudding user response:
strapply can do it without using zero width constructs. Only the second capture group is returned.
library*(gsubfn)
x <- c("the appointment of XX as head", "appoints YY as head") # input
strapply(x, "(appointment of|appoints) (.*?) as head", ~ ..2, simplify = TRUE)
## [1] "XX" "YY"
or use (?:...) to specify that the first parenthesized portion is not to be a capture group:
strapply(x, "(?:appointment of|appoints) (.*?) as head", simplify = TRUE)
## [1] "XX" "YY"
Base R
In base R it could be done with sub if every component of x matches
sub(".*(appointment of|appoints) (.*?) as head.*", "\\2", x)
## [1] "XX" "YY"
or strcapture if not
proto <- data.frame(dummy = character(0), value = character(0))
strcapture("(appointment of|appoints) (.*?) as head", x, proto)[, 2]
## [1] "XX" "YY"