Regex substring from last underscore occurrence till nth underscore (from the back)-CodePudding

I am trying to subset a string from the back based on occurrence of underscores. Example follows:

string <- "trash_trash_trash_keep_keep_keep_trash.trash"

I am trying to substring from the last underscore till an nth occurrence. In this example, the desired output is:

"keep_keep_keep"

My attempt so far is messing up: '^(?:[^_]*){3}(. )_' I think I should address the problem from the back, instead the start of the string.

Any input is appreciated.

CodePudding user response：

I think this regex should do. It uses both a positive and a negative lookaround to isolate the correct fragment:

string <- "trash_trash_trash_keep_keep_keep_trash.trash" 

stringr::str_extract(string, "(?<=^([^_]{0,999}_){3}). (?=_[^_]*$)")

# [1] "keep_keep_keep"

You can change the 3 in the regex, depending on how many underscores from the front you would like.

CodePudding user response：

This is similar to @Kat's approach in the comment but using a function to make it dynamic.

string <- "trash_trash_trash_keep_keep_keep_trash.trash" 

return_last_n_words <- function(x, n) {
  strsplit(x, '_')[[1]] |> head(-1) |> tail(n) |> paste0(collapse = "_")
}

return_last_n_words(string, 3)
#[1] "keep_keep_keep"

return_last_n_words(string, 4)
#[1] "trash_keep_keep_keep"

return_last_n_words(string, 2)
#[1] "keep_keep"

The idea is to split the string by underscore (_), drop the last part, select last n words and paste it in one string.