R: replace or delete part of string after certain combination of letters-CodePudding

I have two datasets each has around 100 variables that have similar names with some minor differences. The variable names in dataset 1 are, CHILD1xxx child1xxx, and the variable names in dataset 2 are, CHILD2xxx child2xxx

For each of the datasets, I want to systematically get rid of the number (i.e.1 or 2) so that the variable names are all CHILDxxx or childxxx.

I was thinking about using str_replace or str_replace_all but wasn't sure what kind of regular expression I would use to capture the above criteria. I would greatly appreciate any insights on this.

CodePudding user response：

Here's one approach using gsub().

It captures the word "child" (ignoring case), and any combination of characters (or none) after a number (\\d will capture a set of digits next to each other, so the number can be anything from 0 to Inf). Using capture groups (the things in brackets), we returns the things before and after the digits, but not the digits "\\1\\2".

x <- c("CHILD1xxx", "child2yyy", "Child23hello")
gsub("^(child)\\d (.*)", "\\1\\2", x, ignore.case = TRUE)

[1] "CHILDxxx"   "childyyy"   "Childhello"

Another approach could be to remove all numbers but this could be problematic if other numbers come up later on in the string.

gsub("\\d", "", x)

[1] "CHILDxxx"   "childyyy"   "Childhello"

CodePudding user response：

To remove a substring form a string, you can conveniently use str_remove. Since the substring to be removed is one or more digits, define \\d as the pattern for the removal:

library(stringr)
str_remove(x, "\\d ")
[1] "CHILDxxx"   "childyyy"   "Childhello"

Data:

x <- c("CHILD1xxx", "child2yyy", "Child23hello")

EDIT:

if the replacements should be implemented in column (variable) names in a dataframe, then you could use str_remove together with rename_with:

df %>%
  rename_with(~str_remove(., "\\d "))
  CHILDxxx childyyy Childhello SomeOther
1       NA       NA         NA        NA

Data:

df <- data.frame(
  CHILD1xxx = NA,
  child2yyy = NA,
  Child23hello = NA,
  SomeOther = NA
)