I'm trying to use command line tools to edit some CSV I have in the following format for several year folders:
- dataset
- year_1 (i.e. 1929)
- csv_filename_1.csv
- csv_filename_2.csv
- csv_filename_3.csv
- ...
- year_2
- ...
- year_1 (i.e. 1929)
I'm trying to append the file name to its content, creating a new column called filename
with ./year_1/csv_filename_1.csv
to all columns in it. After that, I would gzip it.
Due to the number of year folders (almost 100) and the CSVs quantities in each (totaling 100k ), I plan to use gnu parallel to run it, and
I was trying to use sed doing something like
fname="1929/csv_filename_1.csv" && \ # to simulate parallel's parameterization
sed -E -e '1s/$/,filename/' \ # append ",filename" to CSV header
-e '2,\$s/$/,${fname}/' ${fname} \ # append the filename string to the content
But I can't get the sed to work with the second expression because I either get "${fname}" written as-is to the file, or the sed error "sed: -e expression #1, char 6: unknown command: '\'"
complaining about a comma or the slash. I also have tried to group the expressions like -e '1{s/$/,filename/};2,\${s/$/,${fname}/}'
for no avail.
Currently, I gave up sed and started trying with awk, but not knowing why it didn't work is bothering me, so I came to ask why and how to make it work.
Just one more piece of info regarding how I intend to run this thing. It would be something like
find ~/dataset -iname "*csv" -print0 | parallel -0 -j0 '<the whole command here (sed gz)>'
How could I do this? What am I forgetting? Thanks, folks!
PS: I just got it with awk
awk -v d="csv_filename_1.csv" -F"," 'FNR==1{a="filename"} FNR>1{a=d} {print $0","a}' csv_filename_1.csv | less
CodePudding user response:
This might work for you (GNU parallel and sed):
find . -type f -name '*.csv' | parallel sed -i \''1s/$/,filename/;1!s#$#,{}#'\' {}
Use find to deliver the filename to the parallel command.
Use sed to append ,filename
to the heading of each file and the file name present in {}
to each line in the file.
N.B. The use of alternative delimiters s#...#...#
in the second sed command to allow for the filename slashes. Also the find should be executed in the dataset
directory.