I'm grappling with a regex solution to the following problem: say, I have a series of strings that all contain a number of occurrences of the keyword Appendix
or appendix
like this:
text <- c("Appendix abc Appendix def appendix final",
"blah blah Appendix abc Appendix finalissimo")
and I want to delete everything that follows the last occurrence of "Appendix" including the keyword itself to obtain the follwing desired output:
1 Appendix abc Appendix def
2 blah blah Appendix abc
I know (a) tidyverse
solution(s) is/are possible (e.g., Extract all text before the last occurrence of a specific word, but here I'm specifically interested in a regex solution. I've tried a number of such regex solutions but none seem to work. The one I thought most promising is this involving negative lookahead and backreference but it too does not produce the desired result:
library(stringr)
str_extract(text, "(?i).*(?!(appendix).*\\1)")
I'd be grateful for advice why this solution does not work and for a regex solution that does work.
CodePudding user response:
I would use a regex with lookahead logic here:
text <- c("Appendix abc Appendix def appendix final",
"blah blah Appendix abc Appendix finalissimo")
output <- sub("(?i)\\s appendix(?!.*\\bappendix\\b).*", "", text, perl=TRUE)
output
[1] "Appendix abc Appendix def" "blah blah Appendix abc"
CodePudding user response:
You can use sub
. The first .*
is greedy and will take everything until the last match of Appendix.*
.
sub("(.*)Appendix.*", "\\1", text, TRUE)
#[1] "Appendix abc Appendix def " "blah blah Appendix abc "