I have two datasets each has around 100 variables that have similar names with some minor differences. The variable names in dataset 1 are, CHILD1xxx child1xxx, and the variable names in dataset 2 are, CHILD2xxx child2xxx
For each of the datasets, I want to systematically get rid of the number (i.e.1 or 2) so that the variable names are all CHILDxxx or childxxx.
I was thinking about using str_replace or str_replace_all but wasn't sure what kind of regular expression I would use to capture the above criteria. I would greatly appreciate any insights on this.
CodePudding user response:
Here's one approach using gsub()
.
It captures the word "child" (ignoring case), and any combination of characters (or none) after a number (\\d
will capture a set of digits next to each other, so the number can be anything from 0 to Inf
). Using capture groups (the things in brackets), we returns the things before and after the digits, but not the digits "\\1\\2"
.
x <- c("CHILD1xxx", "child2yyy", "Child23hello")
gsub("^(child)\\d (.*)", "\\1\\2", x, ignore.case = TRUE)
[1] "CHILDxxx" "childyyy" "Childhello"
Another approach could be to remove all numbers but this could be problematic if other numbers come up later on in the string.
gsub("\\d", "", x)
[1] "CHILDxxx" "childyyy" "Childhello"
CodePudding user response:
To remove a substring form a string, you can conveniently use str_remove
. Since the substring to be removed is one or more digits, define \\d
as the pattern for the removal:
library(stringr)
str_remove(x, "\\d ")
[1] "CHILDxxx" "childyyy" "Childhello"
Data:
x <- c("CHILD1xxx", "child2yyy", "Child23hello")
EDIT:
if the replacements should be implemented in column (variable) names in a dataframe, then you could use str_remove
together with rename_with
:
df %>%
rename_with(~str_remove(., "\\d "))
CHILDxxx childyyy Childhello SomeOther
1 NA NA NA NA
Data:
df <- data.frame(
CHILD1xxx = NA,
child2yyy = NA,
Child23hello = NA,
SomeOther = NA
)