Home > Software engineering >  Filtering a file with values over 0.70 using AWK
Filtering a file with values over 0.70 using AWK

Time:09-27

I have a file of targets predicted by Diana and I would like to extract those with values over 0.70

>AAGACAACGUUUAAACCA|ENST00000367816|0.999999999975474
UTR3    693-701 0.00499294596715397
UTR3    1045-1053   0.405016433077734
>AAGACAACGUUUAAACCA|ENST00000392971|0.996695852735028
CDS     87-95   0.0112208345874892

I don't know why this script doesn't want to work if it seems to be correct

for file in SC*
do
  grep ">" $file | awk 'BEGIN{FS="|"}{if($3 >= 0.70)}{print $2, $3}' > 70/$file.tab
  
done

The issue is it doesn't filter, can you help me to find out the error?

CodePudding user response:

For a start, that's not a valid awk script since you have a misplaced } character:

BEGIN{FS="|"}{if($3 >= 0.70)}{print $2, $3}
#                           |
#                            ------------- 
#                              move here  |
#                                         V
BEGIN{FS="|"}{if($3 >= 0.70){print $2, $3}}

You also don't need grep because awk can do that itself, and you can also set the field separator without a BEGIN block. For example, here's a command that will output field 3 values greater than 0.997, on lines starting with > (using | as a field separator):

pax> awk -F\| '/^>/ && $3 > 0.997 { print $3 }' prog.in
0.999999999975474

I chose 0.997 to ensure one of the lines in your input file was filtered out for being too low (as proof that it works). For your desired behaviour, the command would be:

pax> awk -F\| '/^>/ && $3 > 0.7 { print $2, $3 }' prog.in
ENST00000367816 0.999999999975474
ENST00000392971 0.996695852735028

Keep in mind I've used > 0.7 as per your "values over 0.70" in the heading and text of your question. If you really mean "values 0.70 and above" as per the code in your question, simply change > into >=.

CodePudding user response:

Looks like you are running a for loop to kick off awk program multiple times(it means each time a file processes an awk program process will be kicked off), you need not to do that, awk program could read all the files with same name/format by itself, so apart from fixing your typo in awk program pass all files into your awk program too like:

awk -F\| 'FNR==1{close(out); out="70/"FILENAME".tab"} /^>/ && $3 > 0.7 { print $2,$3 > out }' SC*

CodePudding user response:

i think it's perhaps safe to regex filter in string mode, instead of numerically :

  • $3 !~/0[.][0-6]/

if it started to interpret the input as a number, and does a numeric compare, that would be subject to rounding errors limited to float-point math. with a string-based filter, you could avoid values above

~ 0 . 699 999 999 999 999 95559107901… (approx. IEEE754 double-precision of 7E-1 )

being rounded up.

  • Related