I am trying to write a code to create a new master dataframe (master_adh_merge
) which merges all source .csv files within a folder ROOT.DIR.MERGES
.
Source filename structure is adh-merged_YYYY-MM-DD.csv
eg. adh-merged_2021-01-01.csv, adh-merged_2021-02-01.csv ... adh-merged_2022-08-01.csv
The data within a source file would be structured as follows and refers to a single month.
Patient_ID | Site | Adherence |
---|---|---|
A_1 | Ax | 30 |
A_2 | Ax | 0 |
B_1 | Bx | 20 |
I want to strip the YYYY_MM_DD string from the each source filename, and paste this string into a new column for every row of data taken from that source file to denote the month from which each row is taken.
I would like master_adh_merge
to look like this
Patient_ID | Site | Adherence | Month |
---|---|---|---|
A_1 | Ax | 30 | 2021-01-01 |
A_2 | Ax | 0 | 2021-01-01 |
B_1 | Bx | 20 | 2021-01-01 |
A_1 | Ax | 15 | 2021-02-01 |
A_2 | Ax | 10 | 2021-02-01 |
B_1 | Bx | 40 | 2021-02-01 |
Each month I will have a new adh-merged_YYYY_MM_DD.csv source file so I am trying to write a loop which I can run each month to keep updating master_adh_merge
.
This is the current code
ROOT.DIR <- "/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts"
ROOT.DIR.MERGES <- paste0(ROOT.DIR, "/monthly-merges")
files <- basename(list.files(ROOT.DIR.MERGES, all.files = TRUE, recursive=TRUE))
master_adh_merge <- vector()
for(i in files){
date_string = str_match(i, "([0-9]{4}-[0-9]{2}-[0-9]{2}).csv$")[,2]
print(i)
df <- read.csv(i)
master_adh_merge <- rbind (master_adh_merge,df)
}
master_adh_merge <- data.frame(master_adh_merge)
When running this I get the error
Warning: cannot open file 'adh-merged_2021-01-01.csv': No such file or directoryError in file(file, "rt") : cannot open the connection
I have checked and I am fully connected to the network path for ROOT.DIR.MERGES
I have run this code to check that I can actually access the files using the full ROOT.DIR.MERGES
path and I can.
adh0 <- read.csv("/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts/monthly-merges/adh-merged_2021-01-01.csv")
When I try to do exact step using ROOT.DIR.MERGES
, however I get an error.
adh1 <- read.csv(ROOT.DIR.MERGES,"adh-merged_2021-01-01.csv")
Error in file(file, "rt") : cannot open the connection
In addition: Warning messages:
1: In file(file, "rt") :
'raw = FALSE' but '/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts/monthly-merges' is not a regular file
2: In file(file, "rt") :
cannot open file '/Volumes/shared/HAR_SP/SP/STU_cmp22rds/CFHH-extracts/monthly-merges': it is a directory
>
Firstly, should my loop (in its current format) allow me to achieve what I am wanting to?
Secondly, from the error messages, it seems like how I am using ROOT.DIR.MERGES
is incorrect, but I am not sure why.
CodePudding user response:
If files
looks something like this:
[1] "adh-merged_2021-01-01.csv" "adh-merged_2021-01-02.csv" "adh-merged_2021-01-03.csv"
then, you can extract the dates from the files
vector, and use do.call()
, with rbind
and lapply()
, like this:
dates = str_extract(files, "[0-9]{4}-[0-9]{2}-[0-9]{2}(?=.csv$)")
do.call(
rbind,
lapply(dates, \(d) {
data = read.csv(paste0(ROOT.DIR.MERGES, "/adh-merged_", d, ".csv"))
data$month = d
data
})
)