How can I create a function such that one of any two consecutive words (in my case separated by an underscore) is removed without specifying the words?
## Some examples
c("ethnicity_ethnicity_selected_choice",
"child_1_child_child_pid")
#> [1] "ethnicity_ethnicity_selected_choice" "child_1_child_child_pid"
## Output needed
c("ethnicity_selected_choice",
"child_1_child_pid")
#> [1] "ethnicity_selected_choice" "child_1_child_pid"
Created on 2022-07-08 by the reprex package (v2.0.1)
CodePudding user response:
You could try to find:
([^_] )(?:_\1(?=_|$))*
Replace with \1
, see an online demo
([^_] )
- A capture group to catch 1 non-underscore characters;(?:_\1
- An non-capture group matching an underscore and a backreference to the 1st capture group;(?=_|$)
- A nested positive lookahead with either an underscore or end-line anchor;)*
- Close non-capture group and match 0 times.
library(stringr)
v <- c("ethnicity_ethnicity_selected_choice",
"child_1_child_child_pid")
v <- str_replace_all(v, "([^_] )(?:_\\1(?=_|$))*", "\\1")
v
Prints:
"ethnicity_selected_choice", "child_1_child_pid"
CodePudding user response:
Another possible solution:
s <- c("ethnicity_ethnicity_selected_choice",
"child_1_child_child_child_pid", "child_1_child_childhood_pid",
"child_child")
gsub("(?<=_|)(\\w )(_\\1) (?=_|$)", "\\1", s, perl = T)
#> [1] "ethnicity_selected_choice" "child_1_child_pid"
#> [3] "child_1_child_childhood_pid" "child"