Home > front end >  Adding part of filename as column to csv files, then concatenate
Adding part of filename as column to csv files, then concatenate

Time:07-26

I have many csv files that look like this:

data/0.Raw/20190401_data.csv

(Only the date in the middle of the filename changes)

I want to concatenate all these files together, but add the date as a new column in the data to be able to distinguish between the different files after merging.

I wrote a bash script that adds the full path and filename as a column in each file, and then merges into a master csv. However, I am having trouble getting rid of the path and the extension to only keep the date portion

The bash script

#! /bin/bash

mkdir data/1.merged
for i in "data/0.Raw/"*.csv; do
  awk -F, -v OFS=, 'NR==1{sub(/\_data.csv$/, "", FILENAME) } NR>1{ $1=FILENAME }1' "$i" |
    column -t > "data/1.merged/"${i/"data/0.Raw/"/""}""
done
awk 'FNR > 1' data/1.merged/*.csv > data/1.merged/all_files

rm data/1.merged/*.csv 
mv data/1.merged/all_files data/1.merged/all_files.csv

using "sub" I was able to remove the "_data.csv" part, but as a result the column gets added as "data/0.Raw/20190401" - that is, I am having trouble removing both the part before the date as well as the part after the date.

I tried replacing sub with gensub to regex match everything except the 8 digits in the middle but that does not seem to work either.

Any ideas on how to solve this?

Thanks!

CodePudding user response:

You can process and concatenate all the files with a single awk call:

awk '
    FNR == 1 {
        date = FILENAME
        gsub(/.*\/|_data\.csv$/,"",date)
        next
    }
    { print date "," $0 }
' data/0.Raw/*_data.csv > all_files.csv
  • Related