R How to remove any two consecutive words?-CodePudding

How can I create a function such that one of any two consecutive words (in my case separated by an underscore) is removed without specifying the words?

## Some examples
c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_pid")
#> [1] "ethnicity_ethnicity_selected_choice" "child_1_child_child_pid"

## Output needed
c("ethnicity_selected_choice",
  "child_1_child_pid")
#> [1] "ethnicity_selected_choice" "child_1_child_pid"

^{Created on 2022-07-08 by the reprex package (v2.0.1)}

CodePudding user response：

You could try to find:

([^_] )(?:_\1(?=_|$))*

Replace with \1, see an online demo

([^_] ) - A capture group to catch 1 non-underscore characters;
(?:_\1 - An non-capture group matching an underscore and a backreference to the 1st capture group;
- (?=_|$) - A nested positive lookahead with either an underscore or end-line anchor;
- )* - Close non-capture group and match 0 times.

library(stringr)
v <- c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_pid")
v <- str_replace_all(v, "([^_] )(?:_\\1(?=_|$))*", "\\1")
v

Prints:

"ethnicity_selected_choice", "child_1_child_pid"

CodePudding user response：

Another possible solution:

s <- c("ethnicity_ethnicity_selected_choice",
  "child_1_child_child_child_pid", "child_1_child_childhood_pid",
  "child_child")

gsub("(?<=_|)(\\w )(_\\1) (?=_|$)", "\\1", s, perl = T)

#> [1] "ethnicity_selected_choice"   "child_1_child_pid"          
#> [3] "child_1_child_childhood_pid" "child"