I am currently trying to crack the seemingly simple problem in R but somehow I am unable to find a way to do it with gsub
, str_match()
or some other rgex
-related functions. Can anyone please help me crack this problem?
Problem Assuming that I have a column vector of certain length (say, 100). Each element in a vector has the form of [string]_[string number]_[someinfo]
. Now, I want to extract only the very first part of each element, namely the [string]_[string number]
. The potential upper bound on the number of characters in [string]_[string number]
, not including _
, could be anywhere between 8 and 20, but there is no fixed length. How can I use some types of rgex
expression to do this in R?
x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')
Desired output.
x1 = c('XY_ABCD101', 'XZ_ACC122', 'XT_AAEEE100', 'XKY_BBAAUUU124')
CodePudding user response:
An option with str_remove
library(stringr)
str_remove(x, "_\\d .*")
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
CodePudding user response:
We could use str_extract
from stringr
package with the regex that matches to remove everything after the second underscore:
library(stringr)
str_extract(x, "[^_]*_[^_]*")
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"
CodePudding user response:
library(stringr)
str_extract(x, "[:alnum:] _[:alnum:] (?=_)")
[1] "XY_ABCD101" "XZ_ACC122"
[3] "XT_AAEEE100" "XKY_BBAAUUU124"
CodePudding user response:
Try this
regmatches(x , regexpr("\\D _\\D \\d " , x))
- Output
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100"
[4] "XKY_BBAAUUU124"
CodePudding user response:
Since your intended output strings always end with the last digital before _
, you can try pattern (?<=\\d)(?=_)
to find the position and remove the chars that follows
> gsub("(?<=\\d)(?=_).*$","",x,perl = TRUE)
[1] "XY_ABCD101" "XZ_ACC122" "XT_AAEEE100" "XKY_BBAAUUU124"