I'm trying to extract part of a file name that matches a set of letters with variable length. The file names consist of several parameters separated by "_", but they vary in the number of parts. I'm trying to pull some of the parameters out to use separately.
Example file names:
a = "Vel_Mag_ft_modelExisting_350cfs_blah3.tif"
b = "Depth_modelDesign_11000cfs_blah2.tif"
I'm trying to pull out the parts that start with "model" so I end up with
"modelExisting" "modelDesign"
The filenames are stored as a variable in a data.frame I've tried
library(tidyverse)
tibble(files = c(a,b))%>%
mutate(attempt1 = str_extract(files, "model"),
attempt2 = str_match(str_split(files, "_"), "model"))
but just ended up with the "model" in all cases and not the "model...." that I need.
The pieces I need are a consisent number of pieces from the end, but I couldn't figure out how to specify that either. I tried
str_split(files, "_")[-3]
but this threw an error that it must be size 480 or 1 not size 479
CodePudding user response:
We can create a function to capture the word before the _
and one or more digits (\\1
), in the replacement, specify the backreference (\\1
) of the captured group
f1 <- function(x) sub(".*_([[:alpha:]] )_\\d .*", "\\1", x)
-testing
> f1(a)
[1] "modelExisting"
> f1(b)
[1] "modelDesign"
CodePudding user response:
We can use strsplit
or regmatches
like below
> s <- c("Vel_Mag_ft_modelExisting_350cfs_blah3.tif", "Depth_modelDesign_11000cfs_blah2.tif")
> lapply(strsplit(s, "_"), function(x) x[which(grepl("^\\d ", x)) - 1])
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"
> regmatches(s, gregexpr("[[:alpha:]] (?=_\\d )", s, perl = TRUE))
[[1]]
[1] "modelExisting"
[[2]]
[1] "modelDesign"