Home > front end >  Mutating new column by extracting parts of a file name in R
Mutating new column by extracting parts of a file name in R

Time:03-15

I have multiple .xlsx files in a folder with different name formatting. I would like to mutate a new column in R into the following format: YYMMDD_wk#. Below is a list of three files with different name formatting. The only difference is some of the weeks are formatted as decimals such as wk2.1

1.) 220128_B4_Height_performance_wk1

2.) 220204_SQ_Height_rerun_performance_plateD_wk44

3.) 220208_B4_Height_performance_noWeight_wk2.1

All I want is the the first and last parts of the file combined like YYMMDD_wk#. The code I have is giving me a NA for the new column.

performance <- list.file(path ="/Users/Desktop/performance", pattern= "*.xlsx", full.names=T) %>%
    setNames(nm = .) %>%
    lapply(read_excel) %>%
    bind_rows(.id = "Week") %>%
    mutate(Week = as.factor(str_extract(Week, "(\\d )_wk(\\d \\.?\\d*)")))

CodePudding user response:

You can use this code:

performance <- list("220128_B4_Height_performance_wk1", "220204_SQ_Height_rerun_performance_plateD_wk44" ,"220208_B4_Height_performance_noWeight_wk2.1")

First select everything before the first _:

first <- sub("\\_.*", "", performance)

Output:

[1] "220128" "220204" "220208"

Second select everything after the last _:

second <- sub('. _(. )', '\\1', performance)

Output:

[1] "wk1"   "wk44"  "wk2.1"

And finally combine first and second part using this code:

paste(first, second, sep = "_")

Output:

[1] "220128_wk1"   "220204_wk44"  "220208_wk2.1"

CodePudding user response:

Using stringr::str_replace_all():

library(stringr)
# example data with directories and file exts
Week <- c(
  "dir\\220128_B4_Height_performance_wk1.xlsx",
  "dir\\220204_SQ_Height_rerun_performance_plateD_wk44.xlsx",
  "dir\\220208_B4_Height_performance_noWeight_wk2.1.xlsx"
)

str_replace_all(basename(Week), "(\\d*)_.*(_wk\\d*)\\..*", "\\1\\2")
#> [1] "220128_wk1"  "220204_wk44" "220208_wk2"

Created on 2022-03-14 by the reprex package (v2.0.1)

Explanation:

  1. basename() removes the path before the filename
  2. (\\d*)_.* captures everything before the first underscore
  3. (_wk\\d*)\\..* captures the _wk# through the first period
  4. "\\1\\2" retains just those two captured groups.
  • Related