I have many csv files that look like this:
data/0.Raw/20190401_data.csv
(Only the date in the middle of the filename changes)
I want to concatenate all these files together, but add the date as a new column in the data to be able to distinguish between the different files after merging.
I wrote a bash script that adds the full path and filename as a column in each file, and then merges into a master csv. However, I am having trouble getting rid of the path and the extension to only keep the date portion
The bash script
#! /bin/bash
mkdir data/1.merged
for i in "data/0.Raw/"*.csv; do
awk -F, -v OFS=, 'NR==1{sub(/\_data.csv$/, "", FILENAME) } NR>1{ $1=FILENAME }1' "$i" |
column -t > "data/1.merged/"${i/"data/0.Raw/"/""}""
done
awk 'FNR > 1' data/1.merged/*.csv > data/1.merged/all_files
rm data/1.merged/*.csv
mv data/1.merged/all_files data/1.merged/all_files.csv
using "sub" I was able to remove the "_data.csv" part, but as a result the column gets added as "data/0.Raw/20190401" - that is, I am having trouble removing both the part before the date as well as the part after the date.
I tried replacing sub with gensub to regex match everything except the 8 digits in the middle but that does not seem to work either.
Any ideas on how to solve this?
Thanks!
CodePudding user response:
You can process and concatenate all the files with a single awk
call:
awk '
FNR == 1 {
date = FILENAME
gsub(/.*\/|_data\.csv$/,"",date)
next
}
{ print date "," $0 }
' data/0.Raw/*_data.csv > all_files.csv