I'm trying to build a airflow DAG and need to split out 7 tables contained in one csv into seven separate csv's.
dataset1
header_a | header_b | header_c |
---|---|---|
One | Two | Three |
One | Two | Three |
<-Always two spaced rows between data sets
dataset N
<-part of csv file giving details on data
header_d | header_e | header_f | header_g |
---|---|---|---|
One | Two | Three | Four |
One | Two | Three | Four |
out
:
dataset1.csv
datasetn.csv
Based on my research i think my solution might lie in awk
searching for the double spaces?
EDIT: In plain text as requested.
table1 details1,
table1 details2,
table1 details3,
header_a,header_b,header_c,
1,2,3
1,2,3
tableN details1,
tableN details2,
tableN details3,
header_a, header_b,header_c,header_N,
1,2,3,4
1,2,3,4
CodePudding user response:
Always two spaced rows between data sets
If your CSV file contains blank lines, and your goal is to write out each chunk of records that is separated by those blank lines into individual files, then you could use awk
with its record separator RS
set to nothing, which then defaults to treating each "paragraph" as a record. Each of them can then be redirected to a file whose name is based on the record number NR
:
awk -vRS= '{print $0 > ("output_" NR ".csv")}' input.csv
This reads from input.csv
and writes the chunks to output_1.csv
, output_2.csv
, output_3.csv
and so forth.
If my interpretation of your input file's structure (or your problem in general) is wrong, please provide more detail to clarify.