Home > Enterprise >  Regex to edit text depending on number of occurrence of key word
Regex to edit text depending on number of occurrence of key word

Time:05-23

I'm grappling with a regex solution to the following problem: say, I have a series of strings that all contain a number of occurrences of the keyword Appendix or appendix like this:

text <- c("Appendix abc Appendix def appendix final",
          "blah blah Appendix abc Appendix finalissimo")

and I want to delete everything that follows the last occurrence of "Appendix" including the keyword itself to obtain the follwing desired output:

1 Appendix abc Appendix def
2 blah blah Appendix abc 

I know (a) tidyverse solution(s) is/are possible (e.g., Extract all text before the last occurrence of a specific word, but here I'm specifically interested in a regex solution. I've tried a number of such regex solutions but none seem to work. The one I thought most promising is this involving negative lookahead and backreference but it too does not produce the desired result:

library(stringr)
str_extract(text, "(?i).*(?!(appendix).*\\1)")

I'd be grateful for advice why this solution does not work and for a regex solution that does work.

CodePudding user response:

I would use a regex with lookahead logic here:

text <- c("Appendix abc Appendix def appendix final",
          "blah blah Appendix abc Appendix finalissimo")
output <- sub("(?i)\\s appendix(?!.*\\bappendix\\b).*", "", text, perl=TRUE)
output

[1] "Appendix abc Appendix def" "blah blah Appendix abc"

CodePudding user response:

You can use sub. The first .* is greedy and will take everything until the last match of Appendix.*.

sub("(.*)Appendix.*", "\\1", text, TRUE)
#[1] "Appendix abc Appendix def " "blah blah Appendix abc "   
  • Related