Home > Net >  gnu parallel sed to edit both csv header and contents
gnu parallel sed to edit both csv header and contents


I'm trying to use command line tools to edit some CSV I have in the following format for several year folders:

  • dataset
    • year_1 (i.e. 1929)
      • csv_filename_1.csv
      • csv_filename_2.csv
      • csv_filename_3.csv
      • ...
    • year_2
      • ...

I'm trying to append the file name to its content, creating a new column called filename with ./year_1/csv_filename_1.csv to all columns in it. After that, I would gzip it.

Due to the number of year folders (almost 100) and the CSVs quantities in each (totaling 100k ), I plan to use gnu parallel to run it, and

I was trying to use sed doing something like

fname="1929/csv_filename_1.csv" &&          \ # to simulate parallel's parameterization
    sed -E -e '1s/$/,filename/'             \ # append ",filename" to CSV header
           -e '2,\$s/$/,${fname}/' ${fname} \ # append the filename string to the content

But I can't get the sed to work with the second expression because I either get "${fname}" written as-is to the file, or the sed error "sed: -e expression #1, char 6: unknown command: '\'" complaining about a comma or the slash. I also have tried to group the expressions like -e '1{s/$/,filename/};2,\${s/$/,${fname}/}' for no avail.

Currently, I gave up sed and started trying with awk, but not knowing why it didn't work is bothering me, so I came to ask why and how to make it work.

Just one more piece of info regarding how I intend to run this thing. It would be something like

find ~/dataset -iname "*csv" -print0 | parallel -0 -j0 '<the whole command here (sed   gz)>'

How could I do this? What am I forgetting? Thanks, folks!

PS: I just got it with awk

awk -v d="csv_filename_1.csv" -F"," 'FNR==1{a="filename"} FNR>1{a=d} {print $0","a}' csv_filename_1.csv | less

CodePudding user response:

This might work for you (GNU parallel and sed):

find . -type f -name '*.csv' | parallel sed -i \''1s/$/,filename/;1!s#$#,{}#'\' {}

Use find to deliver the filename to the parallel command.

Use sed to append ,filename to the heading of each file and the file name present in {} to each line in the file.

N.B. The use of alternative delimiters s#...#...# in the second sed command to allow for the filename slashes. Also the find should be executed in the dataset directory.

  • Related