I'm sure this is a silly question, I have a couple of strings such as data_PB_Belf.csv
and I need to exctract only PB_Belf
(and so on). How can I exctract everything after the first _ up to . (preferably using stringr
) ?
data
[1] "data_PB_Belf.csv" "data_PB_NI.csv" ...
str_replace(data[1], "^[^_] _([^_] )_.*", "\\1") ## the closer I got, it returns "PB"
- I tried to adapt the code from here, but I wasn't able to. I'm sure that there's a way to use
str_replace()
orstr_sub()
orstr_extract()
, I just can't get the right Regex. Thanks in advance!
CodePudding user response:
We may match the one or more characters that are not a _
([^_]
) from the start (^
) of the string, followed by an _
, then capture the characters that are not a dot (.
) (([^.] )
) followed by .
(dot is metacharacter, so escape \\
), followed by any characters and replace with the backreference (\\1
) of the captured group
sub("^[^_] _([^.] )\\..*", "\\1", data)
[1] "PB_Belf" "PB_NI"
Or with str_replace
library(stringr)
str_replace(data, "^[^_] _([^.] )\\..*", "\\1")
[1] "PB_Belf" "PB_NI"
CodePudding user response:
There are other, simpler, options available too. For example, you can use, as you mentioned, str_extract
in conjunction with lookarounds:
library(stringr)
str_extract(x, "(?<=_).*?(?=\\.)")
[1] "PB_Belf" "PB_NI"
Here we are using:
(?<=_)
: positive look behind to assert that what we want to extract must be preceded by_
and(?=\\.)
: positive look ahead to assert that what we want to extract must be followed by a dot.
Data:
x <- c("data_PB_Belf.csv", "data_PB_NI.csv")