I am trying to pull out rows from a .csv
file where a variable matches a certain identifier. Here's an example dataset (myfile.csv
)
id,x,y,z
A01,1,5,7
A02,4,4,7
B01,1,6,6
A01,5,7,4
A01,4,8,4
C02,3,1,3
A01,1,2,3
I could use the following:
awk -F',' '{if($1 == "A01") print}' myfile.csv > outfile.csv
or
awk -F',' '{if($1 == "A01") print > "outfile.csv" }' myfile.csv
which will result in outfile.csv
:
A01,1,5,7
A01,5,7,4
A01,4,8,4
A01,1,2,3
However, I am dealing with a very large dataset (200Gb) and when running, I have to wait for awk
to finish before it will output to outfile.csv
.
Is there a way for awk
to print to the file at the time it hits the correct statement (i.e. the file is updated as awk
processes)
CodePudding user response:
Try running following command once. So what I am doing in here is: in spite of doing redirection in each condition, doing one single time output redirection to output file after awk program completes its run. I am pretty sure this should be fast enough compare to your current command, though fair warning; haven't tested it.
awk -F',' '{if($1 == "A01") print}' myfile.csv > "outputfile.csv"
OR NO need to explicitly mention if condition and print, by default if a condition if TRUE in awk
it prints that line as a default action so above could be shorten to following:
awk -F',' '($1 == "A01")' myfile.csv > "outputfile.csv"
CodePudding user response:
Like most tools do, awk is buffering it's output for efficiency so just tell it to flush it's buffer after every print:
awk -F',' '$1 == "A01"{ print; fflush() }' myfile.csv > outfile.csv