Extract part of the strings with specific format-CodePudding

I am currently trying to crack the seemingly simple problem in R but somehow I am unable to find a way to do it with gsub, str_match() or some other rgex-related functions. Can anyone please help me crack this problem?

Problem Assuming that I have a column vector of certain length (say, 100). Each element in a vector has the form of [string]_[string number]_[someinfo]. Now, I want to extract only the very first part of each element, namely the [string]_[string number]. The potential upper bound on the number of characters in [string]_[string number], not including _, could be anywhere between 8 and 20, but there is no fixed length. How can I use some types of rgex expression to do this in R?

x = c('XY_ABCD101_12_ACE', 'XZ_ACC122_100_BAN', 'XT_AAEEE100_12345_ABC', 'XKY_BBAAUUU124_100')

Desired output.

x1 = c('XY_ABCD101', 'XZ_ACC122', 'XT_AAEEE100', 'XKY_BBAAUUU124')

CodePudding user response：

An option with str_remove

library(stringr)
str_remove(x, "_\\d .*")
[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

CodePudding user response：

We could use str_extract from stringr package with the regex that matches to remove everything after the second underscore:

library(stringr)
str_extract(x, "[^_]*_[^_]*")

[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"

CodePudding user response：

library(stringr)
str_extract(x, "[:alnum:] _[:alnum:] (?=_)")

[1] "XY_ABCD101"     "XZ_ACC122"     
[3] "XT_AAEEE100"    "XKY_BBAAUUU124"

CodePudding user response：

Try this

regmatches(x , regexpr("\\D _\\D \\d " , x))

Output

[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"   
[4] "XKY_BBAAUUU124"

CodePudding user response：

Since your intended output strings always end with the last digital before _, you can try pattern (?<=\\d)(?=_) to find the position and remove the chars that follows

> gsub("(?<=\\d)(?=_).*$","",x,perl = TRUE)
[1] "XY_ABCD101"     "XZ_ACC122"      "XT_AAEEE100"    "XKY_BBAAUUU124"