Home > Back-end >  Extracting filename string within a loop in R
Extracting filename string within a loop in R

Time:08-18

I am trying to write a code to create a new master dataframe (master_adh_merge) which merges all source .csv files within a folder ROOT.DIR.MERGES.

Source filename structure is adh-merged_YYYY-MM-DD.csv
eg. adh-merged_2021-01-01.csv, adh-merged_2021-02-01.csv ... adh-merged_2022-08-01.csv

The data within a source file would be structured as follows and refers to a single month.

Patient_ID Site Adherence
A_1 Ax 30
A_2 Ax 0
B_1 Bx 20

I want to strip the YYYY_MM_DD string from the each source filename, and paste this string into a new column for every row of data taken from that source file to denote the month from which each row is taken.

I would like master_adh_merge to look like this

Patient_ID Site Adherence Month
A_1 Ax 30 2021-01-01
A_2 Ax 0 2021-01-01
B_1 Bx 20 2021-01-01
A_1 Ax 15 2021-02-01
A_2 Ax 10 2021-02-01
B_1 Bx 40 2021-02-01

Each month I will have a new adh-merged_YYYY_MM_DD.csv source file so I am trying to write a loop which I can run each month to keep updating master_adh_merge.

This is the current code

ROOT.DIR <- "/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts"
ROOT.DIR.MERGES <- paste0(ROOT.DIR, "/monthly-merges")
files <- basename(list.files(ROOT.DIR.MERGES, all.files = TRUE, recursive=TRUE))

master_adh_merge <- vector()

for(i in files){
  date_string = str_match(i, "([0-9]{4}-[0-9]{2}-[0-9]{2}).csv$")[,2]
  print(i)
  df <- read.csv(i)
  master_adh_merge <- rbind (master_adh_merge,df)
}

master_adh_merge <- data.frame(master_adh_merge)

When running this I get the error

Warning: cannot open file 'adh-merged_2021-01-01.csv': No such file or directoryError in file(file, "rt") : cannot open the connection

I have checked and I am fully connected to the network path for ROOT.DIR.MERGES

I have run this code to check that I can actually access the files using the full ROOT.DIR.MERGES path and I can.

adh0 <- read.csv("/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts/monthly-merges/adh-merged_2021-01-01.csv")

When I try to do exact step using ROOT.DIR.MERGES, however I get an error.

adh1 <- read.csv(ROOT.DIR.MERGES,"adh-merged_2021-01-01.csv")

Error in file(file, "rt") : cannot open the connection
In addition: Warning messages:
1: In file(file, "rt") :
  'raw = FALSE' but '/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts/monthly-merges' is not a regular file
2: In file(file, "rt") :
  cannot open file '/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts/monthly-merges': it is a directory
> 

Firstly, should my loop (in its current format) allow me to achieve what I am wanting to?

Secondly, from the error messages, it seems like how I am using ROOT.DIR.MERGES is incorrect, but I am not sure why.

CodePudding user response:

If files looks something like this:

[1] "adh-merged_2021-01-01.csv" "adh-merged_2021-01-02.csv" "adh-merged_2021-01-03.csv"

then, you can extract the dates from the files vector, and use do.call(), with rbind and lapply(), like this:

dates = str_extract(files, "[0-9]{4}-[0-9]{2}-[0-9]{2}(?=.csv$)")

do.call(
  rbind, 
  lapply(dates, \(d) {
    data = read.csv(paste0(ROOT.DIR.MERGES, "/adh-merged_", d, ".csv"))
    data$month = d
    data

  })
)
  • Related