Pull rows in awk to file line-by-line-CodePudding

I am trying to pull out rows from a .csv file where a variable matches a certain identifier. Here's an example dataset (myfile.csv)

id,x,y,z
A01,1,5,7
A02,4,4,7
B01,1,6,6
A01,5,7,4
A01,4,8,4
C02,3,1,3
A01,1,2,3

I could use the following:

awk -F',' '{if($1 == "A01") print}' myfile.csv > outfile.csv

awk -F',' '{if($1 == "A01") print > "outfile.csv" }' myfile.csv

which will result in outfile.csv:

A01,1,5,7
A01,5,7,4
A01,4,8,4
A01,1,2,3

However, I am dealing with a very large dataset (200Gb) and when running, I have to wait for awk to finish before it will output to outfile.csv.

Is there a way for awk to print to the file at the time it hits the correct statement (i.e. the file is updated as awk processes)

CodePudding user response：

Try running following command once. So what I am doing in here is: in spite of doing redirection in each condition, doing one single time output redirection to output file after awk program completes its run. I am pretty sure this should be fast enough compare to your current command, though fair warning; haven't tested it.

awk -F',' '{if($1 == "A01") print}' myfile.csv > "outputfile.csv"

OR NO need to explicitly mention if condition and print, by default if a condition if TRUE in awk it prints that line as a default action so above could be shorten to following:

awk -F',' '($1 == "A01")' myfile.csv > "outputfile.csv"

CodePudding user response：

Like most tools do, awk is buffering it's output for efficiency so just tell it to flush it's buffer after every print:

awk -F',' '$1 == "A01"{ print; fflush() }' myfile.csv > outfile.csv