Home > Mobile >  BASH Split CSV Into Multiple Files Based on Column Value
BASH Split CSV Into Multiple Files Based on Column Value

Time:09-16

I have a file named fulldata.tmp which contains pipe delimited data (I can change it to comma if needed but generally like using pipe). With a BASH Shell script I would like to split lines out to new files based on the value in column 1 and retain the header. I'm pulling this data via SQL so I can pre-sort if needed but I don't have direct access to the terminal running this script so development and debugging is difficult. I've searched dozens of examples mostly recommending awk but I'm not connecting the dots. This is my core need and below are a couple quality of life options I'd like if it's easy along with example data.

Nice if possible: I would like to specify which columns print to the new files (my example desired output shows I want columns 1-4 out of the initial 5 columns).

Nice if possible: I would like the new files named with a prefix then the data that is being split on followed by extension: final_$col1.csv

GROUPID|LABEL|DATE|ACTIVE|COMMENT
ABC|001|2022-09-15|True|None
DEF|001|2022-09-16|False|None
GHI|002|2022-10-17|True|Future

final_ABC.csv

ABC|001|2022-09-15|True

final_DEF.csv

DEF|001|2022-09-16|False

final_GHI.csv

GHI|002|2022-10-17|True

CodePudding user response:

Maybe awk

awk -F'|' -v OFS='|' 'NR>1{print $1, $2, $3, $4 > "final_"$1".csv"}' fulldata.tmp

Check the created csv files and it's content.

tail -n 1 final*.csv 

Output

==> final_ABC.csv <==
ABC|001|2022-09-15|True

==> final_DEF.csv <==
DEF|001|2022-09-16|False

==> final_GHI.csv <==
GHI|002|2022-10-17|True

CodePudding user response:

rquery (https://github.com/fuyuncat/rquery/releases) can do this straightforward

[ rquery]$ cat samples/file1.csv
GROUPID|LABEL|DATE|ACTIVE|COMMENT
ABC|001|2022-09-15|True|None
DEF|001|2022-09-16|False|None
GHI|002|2022-10-17|True|Future
[ rquery]$ ./rq -n -i "|" -q "p d/\|/ | s @1,@2,@3,@4 | >'/tmp/file_' @1 '.csv'" samples/file1.csv
[ rquery]$ cat /tmp/file_ABC.csv
ABC|001|2022-09-15|True
[ rquery]$ cat /tmp/file_DEF.csv
DEF|001|2022-09-16|False
[ rquery]$ cat /tmp/file_GHI.csv
GHI|002|2022-10-17|True

CodePudding user response:

This works and preserves the header which I believe was a requirement.

cut -d '|' -f 1 fulldata.tmp | grep -v GROUPID | sort -u | while read -r id; do grep -E "^${id}|^GROUPID" fulldata.tmp > final_${id}.csv; done

I think a pure awk solution is better though.

  • Related