Home > Software design >  Extracting string after a specific pattern in R
Extracting string after a specific pattern in R

Time:03-22

I want to extract strings from a list that contains identifiers of different lengths. Essentially, I want to keep all of the characters of identifiers up to 3rd occurrence of "-", except the alphabet at the end, and remove the rest. The example of the list is below:

mylist <- c("abc-nop-7a-2","abc-nop-7b-3p", "abc-nop-18a-5p/18c-5p", "abc-xyz-198_5p")

I want the resulting list to look like:

result <- c("abc-nop-7","abc-nop-7", "abc-nop-18", "abc-xyz-198")

I have tried splitting the strings and then taking the section I want, but I was not sure how to call sections up to a certain point. I tried:

mylist <- gsub("-", "_", mylist) #"-" was not not acceptable as a character
mylist <- strsplit(mylist, "_")
sapply(mylist, `[`, 3)

But of course, the above only gives me something like this:

"7","7", "18", "198"

Is there a way to call extract 1~3 section I split in the method above? or if there are more efficient ways to do the task through stringr or something, I'd appreciate that as well.

Thanks in advance.

CodePudding user response:

We can capture as a group and replace with the backreference (\\1)

sub("^(([^-] -){2}[0-9] ).*", "\\1", mylist)
[1] "abc-nop-7"   "abc-nop-7"   "abc-nop-18"  "abc-xyz-198"

the pattern matched is two ({2}) instances of characters that are not a - ([^-] ) followed by a - from the start (^) of the string, followed by one or more digits ([0-9] ), captured ((...)) and in the replacement, specify the backreference of the captured group

  • Related