Home > Back-end >  Regex help, subset based on nth occurance from the end of string
Regex help, subset based on nth occurance from the end of string

Time:11-16

An example vector:

string <- "Junk1_Junk2_Junk3__ID1_Junk4_Junk5.pdf"

I am trying to subset ID1 by counting _ (underscores) from the right; so subset between the second and 3rd underscore from the right.

expected output: ID1

My attempt was to try to use the double __, but this is not going to work, because not all my string list has it.

Attempt: (_){2}([^_] )

Side note, I am trying to get comfortable with regex; please recommend a resource to build and test.

Any assistance is appreciated.

CodePudding user response:

You can use

library(stringr)
stringr::str_extract(string, "[^_] (?=(?:_[^_]*){2}$)")

Or, same approach with base R:

## Base R:
sub(".*?([^_] )(?:_[^_]*){2}$", "\\1", string)

See the regex demo and the R demo online.

Details:

  • [^_] - one or more chars other than _
  • (?=(?:_[^_]*){2}$) - a positive lookahead that requires two sequences of _ and then zero or more repetitons of any char other than _ till the end of string
  • .*?([^_] )(?:_[^_]*){2}$ matches
    • .*? - any zero or more chars, as few as possible
    • ([^_] ) - Capturing group 1 (\1 in the replacement pattern refers to this captured string): one or more chars other than _
    • (?:_[^_]*){2} - two sequences of _ and then zero or more repetitons of any char other than _
    • $ - end of string.

CodePudding user response:

sub(".*_([^_] )(_[^_] ){2}", "\\1", string)
[1] "ID1"
  • Related