how to extract part of a string matching pattern with separation in r-CodePudding

I'm trying to extract part of a file name that matches a set of letters with variable length. The file names consist of several parameters separated by "_", but they vary in the number of parts. I'm trying to pull some of the parameters out to use separately.

Example file names:

a = "Vel_Mag_ft_modelExisting_350cfs_blah3.tif"
b = "Depth_modelDesign_11000cfs_blah2.tif"

I'm trying to pull out the parts that start with "model" so I end up with

"modelExisting" "modelDesign"

The filenames are stored as a variable in a data.frame I've tried

library(tidyverse)
tibble(files = c(a,b))%>%
  mutate(attempt1 = str_extract(files, "model"),
         attempt2 = str_match(str_split(files, "_"), "model"))

but just ended up with the "model" in all cases and not the "model...." that I need.

The pieces I need are a consisent number of pieces from the end, but I couldn't figure out how to specify that either. I tried

str_split(files, "_")[-3]

but this threw an error that it must be size 480 or 1 not size 479

CodePudding user response：

We can create a function to capture the word before the _ and one or more digits (\\1), in the replacement, specify the backreference (\\1) of the captured group

f1 <- function(x) sub(".*_([[:alpha:]] )_\\d .*", "\\1", x)

-testing

> f1(a)
[1] "modelExisting"
> f1(b)
[1] "modelDesign"

CodePudding user response：

We can use strsplit or regmatches like below

> s <- c("Vel_Mag_ft_modelExisting_350cfs_blah3.tif", "Depth_modelDesign_11000cfs_blah2.tif")

> lapply(strsplit(s, "_"), function(x) x[which(grepl("^\\d ", x)) - 1])
[[1]]
[1] "modelExisting"

[[2]]
[1] "modelDesign"


> regmatches(s, gregexpr("[[:alpha:]] (?=_\\d )", s, perl = TRUE))
[[1]]
[1] "modelExisting"

[[2]]
[1] "modelDesign"