Regex help, subset based on nth occurance from the end of string-CodePudding

An example vector:

string <- "Junk1_Junk2_Junk3__ID1_Junk4_Junk5.pdf"

I am trying to subset ID1 by counting _ (underscores) from the right; so subset between the second and 3rd underscore from the right.

expected output: ID1

My attempt was to try to use the double __, but this is not going to work, because not all my string list has it.

Attempt: (_){2}([^_] )

Side note, I am trying to get comfortable with regex; please recommend a resource to build and test.

Any assistance is appreciated.

CodePudding user response：

You can use

library(stringr)
stringr::str_extract(string, "[^_] (?=(?:_[^_]*){2}$)")

Or, same approach with base R:

## Base R:
sub(".*?([^_] )(?:_[^_]*){2}$", "\\1", string)

Details:

[^_] - one or more chars other than _
(?=(?:_[^_]*){2}$) - a positive lookahead that requires two sequences of _ and then zero or more repetitons of any char other than _ till the end of string
.*?([^_] )(?:_[^_]*){2}$ matches
- .*? - any zero or more chars, as few as possible
- ([^_] ) - Capturing group 1 (\1 in the replacement pattern refers to this captured string): one or more chars other than _
- (?:_[^_]*){2} - two sequences of _ and then zero or more repetitons of any char other than _
- $ - end of string.

CodePudding user response：

sub(".*_([^_] )(_[^_] ){2}", "\\1", string)
[1] "ID1"