Adding part of filename as column to csv files, then concatenate-CodePudding

I have many csv files that look like this:

data/0.Raw/20190401_data.csv

(Only the date in the middle of the filename changes)

I want to concatenate all these files together, but add the date as a new column in the data to be able to distinguish between the different files after merging.

I wrote a bash script that adds the full path and filename as a column in each file, and then merges into a master csv. However, I am having trouble getting rid of the path and the extension to only keep the date portion

The bash script

#! /bin/bash

mkdir data/1.merged
for i in "data/0.Raw/"*.csv; do
  awk -F, -v OFS=, 'NR==1{sub(/\_data.csv$/, "", FILENAME) } NR>1{ $1=FILENAME }1' "$i" |
    column -t > "data/1.merged/"${i/"data/0.Raw/"/""}""
done
awk 'FNR > 1' data/1.merged/*.csv > data/1.merged/all_files

rm data/1.merged/*.csv 
mv data/1.merged/all_files data/1.merged/all_files.csv

using "sub" I was able to remove the "_data.csv" part, but as a result the column gets added as "data/0.Raw/20190401" - that is, I am having trouble removing both the part before the date as well as the part after the date.

I tried replacing sub with gensub to regex match everything except the 8 digits in the middle but that does not seem to work either.

Any ideas on how to solve this?

Thanks!

CodePudding user response：

You can process and concatenate all the files with a single awk call:

awk '
    FNR == 1 {
        date = FILENAME
        gsub(/.*\/|_data\.csv$/,"",date)
        next
    }
    { print date "," $0 }
' data/0.Raw/*_data.csv > all_files.csv