stringr remove n-th occurence of a character-CodePudding

I know this question has been asked and answered before (e.g. Replace Nth occurrence of a character in a string with something else or Extract string between nth occurrence of character and another character), but for whatever reason I can't get the respective regex solutions work.

I basically want to remove the n-th occurence of a certain phrase from a character string, in particular, in the example I want to remove the second occurence of "ab", but neither of my attempts work.

I have a workaround by using str_locate_all, and then do a str_sub based on the positions of the phrase, but I'm hoping for a straightforward regex solution with str_remove or str_replace if I want to replace this second occurrence with something.

text <- "abcdabef"

expected output:

"abcdef"

Non-working solutions that I tried (among many others):

library(stringr)
str_remove_all(y, "(?:ab){2}")
str_remove_all(y, "(?:ab){1}.*(ab)")

CodePudding user response：

To remove the second occurrence only, you need to use

sub("(ab.*?)ab", "\\1", "abcdabef")

To remove the nth occurrence, use a limiting quantifier after the group where the only min value should be equal to n-1:

n <- 2
sub(paste0("((?:ab.*?){",n-1,"})ab"), "\\1", "abcdabef", perl=TRUE)

Note:

You need to use sub and not gsub since you only need a single replacement to be done.

Pattern details (when n=3):

((?:ab.*?){2}) - Group 1 (\1): two occurrences of ab and any zero or more chars other than line break chars (since I am using perl=TRUE here, if you need multiple line matching support, add (?s) at the start or replace .*? with (?s:.*?)) as few as possible
ab - an ab

If you have arbitrary strings with special chars in them, you need to escape them:

regex.escape <- function(string) {
  gsub("([][{}() *^$|\\\\?.])", "\\\\\\1", string)
}

word <- "a (b)"
word <- regex.escape(word)
text <- "a (b)1___a (b)2___a (b)3___a (b)4"
n <- 3 # Let's remove the 3rd occurrence of a (b)
sub(paste0("((?:", word, ".*?){",n-1,"})", word), "\\1", text, perl=TRUE)
## => [1] "a (b)1___a (b)2___3___a (b)4"

See the regex demo.

CodePudding user response：

Another possible solution, using stringr and no regex (it also works for any n):

library(tidyverse)

text <-  "abcdabef"

n <- 2

str_locate_all(text, "ab") %>% .[[1]] %>% 
  when(n <= nrow(.) ~ `str_sub<-`(text, .[n, 1], .[n, 2], value = ""), ~ text)

#> [1] "abcdef"