Home > OS >  Pull rows in awk to file line-by-line
Pull rows in awk to file line-by-line

Time:12-17

I am trying to pull out rows from a .csv file where a variable matches a certain identifier. Here's an example dataset (myfile.csv)

id,x,y,z
A01,1,5,7
A02,4,4,7
B01,1,6,6
A01,5,7,4
A01,4,8,4
C02,3,1,3
A01,1,2,3

I could use the following:

awk -F',' '{if($1 == "A01") print}' myfile.csv > outfile.csv

or

awk -F',' '{if($1 == "A01") print > "outfile.csv" }' myfile.csv

which will result in outfile.csv:

A01,1,5,7
A01,5,7,4
A01,4,8,4
A01,1,2,3

However, I am dealing with a very large dataset (200Gb) and when running, I have to wait for awk to finish before it will output to outfile.csv.

Is there a way for awk to print to the file at the time it hits the correct statement (i.e. the file is updated as awk processes)

CodePudding user response:

Try running following command once. So what I am doing in here is: in spite of doing redirection in each condition, doing one single time output redirection to output file after awk program completes its run. I am pretty sure this should be fast enough compare to your current command, though fair warning; haven't tested it.

awk -F',' '{if($1 == "A01") print}' myfile.csv > "outputfile.csv"

OR NO need to explicitly mention if condition and print, by default if a condition if TRUE in awk it prints that line as a default action so above could be shorten to following:

awk -F',' '($1 == "A01")' myfile.csv > "outputfile.csv"

CodePudding user response:

Like most tools do, awk is buffering it's output for efficiency so just tell it to flush it's buffer after every print:

awk -F',' '$1 == "A01"{ print; fflush() }' myfile.csv > outfile.csv
  • Related