I have multiple .xlsx files in a folder with different name formatting. I would like to mutate a new column in R into the following format: YYMMDD_wk#. Below is a list of three files with different name formatting. The only difference is some of the weeks are formatted as decimals such as wk2.1
1.) 220128_B4_Height_performance_wk1
2.) 220204_SQ_Height_rerun_performance_plateD_wk44
3.) 220208_B4_Height_performance_noWeight_wk2.1
All I want is the the first and last parts of the file combined like YYMMDD_wk#. The code I have is giving me a NA for the new column.
performance <- list.file(path ="/Users/Desktop/performance", pattern= "*.xlsx", full.names=T) %>%
setNames(nm = .) %>%
lapply(read_excel) %>%
bind_rows(.id = "Week") %>%
mutate(Week = as.factor(str_extract(Week, "(\\d )_wk(\\d \\.?\\d*)")))
CodePudding user response:
You can use this code:
performance <- list("220128_B4_Height_performance_wk1", "220204_SQ_Height_rerun_performance_plateD_wk44" ,"220208_B4_Height_performance_noWeight_wk2.1")
First select everything before the first _
:
first <- sub("\\_.*", "", performance)
Output:
[1] "220128" "220204" "220208"
Second select everything after the last _
:
second <- sub('. _(. )', '\\1', performance)
Output:
[1] "wk1" "wk44" "wk2.1"
And finally combine first and second part using this code:
paste(first, second, sep = "_")
Output:
[1] "220128_wk1" "220204_wk44" "220208_wk2.1"
CodePudding user response:
Using stringr::str_replace_all()
:
library(stringr)
# example data with directories and file exts
Week <- c(
"dir\\220128_B4_Height_performance_wk1.xlsx",
"dir\\220204_SQ_Height_rerun_performance_plateD_wk44.xlsx",
"dir\\220208_B4_Height_performance_noWeight_wk2.1.xlsx"
)
str_replace_all(basename(Week), "(\\d*)_.*(_wk\\d*)\\..*", "\\1\\2")
#> [1] "220128_wk1" "220204_wk44" "220208_wk2"
Created on 2022-03-14 by the reprex package (v2.0.1)
Explanation:
basename()
removes the path before the filename(\\d*)_.*
captures everything before the first underscore(_wk\\d*)\\..*
captures the _wk# through the first period"\\1\\2"
retains just those two captured groups.