Home > Software design >  REGEX to extract a string after an underscore up to a final mark in R
REGEX to extract a string after an underscore up to a final mark in R

Time:10-26

I'm sure this is a silly question, I have a couple of strings such as data_PB_Belf.csv and I need to exctract only PB_Belf (and so on). How can I exctract everything after the first _ up to . (preferably using stringr ) ?

data
[1] "data_PB_Belf.csv" "data_PB_NI.csv" ...

str_replace(data[1], "^[^_] _([^_] )_.*", "\\1") ## the closer I got, it returns "PB"
  • I tried to adapt the code from here, but I wasn't able to. I'm sure that there's a way to use str_replace() or str_sub() or str_extract(), I just can't get the right Regex. Thanks in advance!

CodePudding user response:

We may match the one or more characters that are not a _ ([^_] ) from the start (^) of the string, followed by an _, then capture the characters that are not a dot (.) (([^.] )) followed by . (dot is metacharacter, so escape \\), followed by any characters and replace with the backreference (\\1) of the captured group

sub("^[^_] _([^.] )\\..*", "\\1", data)
[1] "PB_Belf" "PB_NI" 

Or with str_replace

library(stringr)
str_replace(data, "^[^_] _([^.] )\\..*", "\\1")
[1] "PB_Belf" "PB_NI" 

CodePudding user response:

There are other, simpler, options available too. For example, you can use, as you mentioned, str_extract in conjunction with lookarounds:

library(stringr)
str_extract(x, "(?<=_).*?(?=\\.)")
[1] "PB_Belf" "PB_NI"

Here we are using:

  • (?<=_): positive look behind to assert that what we want to extract must be preceded by _ and
  • (?=\\.): positive look ahead to assert that what we want to extract must be followed by a dot.

Data:

x <- c("data_PB_Belf.csv", "data_PB_NI.csv")
  • Related