I know this question has been asked and answered before (e.g. Replace Nth occurrence of a character in a string with something else or Extract string between nth occurrence of character and another character), but for whatever reason I can't get the respective regex solutions work.
I basically want to remove the n-th occurence of a certain phrase from a character string, in particular, in the example I want to remove the second occurence of "ab", but neither of my attempts work.
I have a workaround by using str_locate_all
, and then do a str_sub
based on the positions of the phrase, but I'm hoping for a straightforward regex solution with str_remove
or str_replace
if I want to replace this second occurrence with something.
text <- "abcdabef"
expected output:
"abcdef"
Non-working solutions that I tried (among many others):
library(stringr)
str_remove_all(y, "(?:ab){2}")
str_remove_all(y, "(?:ab){1}.*(ab)")
CodePudding user response:
To remove the second occurrence only, you need to use
sub("(ab.*?)ab", "\\1", "abcdabef")
To remove the nth occurrence, use a limiting quantifier after the group where the only min
value should be equal to n-1
:
n <- 2
sub(paste0("((?:ab.*?){",n-1,"})ab"), "\\1", "abcdabef", perl=TRUE)
Note:
You need to use sub
and not gsub
since you only need a single replacement to be done.
Pattern details (when n=3
):
((?:ab.*?){2})
- Group 1 (\1
): two occurrences ofab
and any zero or more chars other than line break chars (since I am usingperl=TRUE
here, if you need multiple line matching support, add(?s)
at the start or replace.*?
with(?s:.*?)
) as few as possibleab
- anab
If you have arbitrary strings with special chars in them, you need to escape them:
regex.escape <- function(string) {
gsub("([][{}() *^$|\\\\?.])", "\\\\\\1", string)
}
word <- "a (b)"
word <- regex.escape(word)
text <- "a (b)1___a (b)2___a (b)3___a (b)4"
n <- 3 # Let's remove the 3rd occurrence of a (b)
sub(paste0("((?:", word, ".*?){",n-1,"})", word), "\\1", text, perl=TRUE)
## => [1] "a (b)1___a (b)2___3___a (b)4"
See the regex demo.
CodePudding user response:
Another possible solution, using stringr
and no regex
(it also works for any n
):
library(tidyverse)
text <- "abcdabef"
n <- 2
str_locate_all(text, "ab") %>% .[[1]] %>%
when(n <= nrow(.) ~ `str_sub<-`(text, .[n, 1], .[n, 2], value = ""), ~ text)
#> [1] "abcdef"