Processing of the data from a big number of input files-CodePudding

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:

awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log

Here is the format of each log file, from which the number

-9.14

should be taken

AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite:          #
#                                                               #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli  #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force      #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021)       #
# DOI 10.1021/acs.jcim.1c00203                                  #
#                                                               #
# O. Trott, A. J. Olson,                                        #
# AutoDock Vina: improving the speed and accuracy of docking    #
# with a new scoring function, efficient optimization and       #
# multithreading, J. Comp. Chem. (2010)                         #
# DOI 10.1002/jcc.21334                                         #
#                                                               #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for  #
# more information.                                             #
#################################################################

Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size  : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1

Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ... 
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
----- ------------ ---------- ----------
   1        -9.14          0          0
   2       -9.109      2.002       2.79
   3       -9.006      1.772      2.315
   4       -8.925          2      2.744
   5       -8.882      3.592      8.189
   6       -8.803      1.564      2.092
   7       -8.507      4.014      7.308
   8        -8.36      2.489      8.193
   9       -8.356      2.529      8.104
  10        -8.33      1.408      3.841

It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:

./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long

How could I adapt the AWK script to be able processing any number of input logs?

CodePudding user response：

If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:

results=. # ??? 
i=00001   # ???
output=   # ???

find "$results" -type f -name "*_rep$i.log" -exec awk '
    FNR == 1 {
        filename = FILENAME
        sub(/.*\//,"",filename)
        sub(/\.[^.]*$/,"",filename)
    }
    $1 == 1 { printf "%s: %s\n", filename, $2 }
' {}   |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv

^{edit: appended the rest of the chain as asked in comment}

note: you might need to supply other predicates to the find command if you don't want it to search the sub-folders of $results recursively

CodePudding user response：

Note that your error message:

./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long

is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.

Assuming printf is a builtin on your system:

printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'

or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:

printf '%s' "${results}"/*_rep"${i}".log |
awk '
    NR==FNR {
        ARGV[ARGC  ] = $0
        next
    }
    ...
'

If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:

printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
    NR==FNR {
        ARGV[ARGC  ] = $0
        next
    }
    ...
' RS='\0' - RS='\n'

CodePudding user response：

When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be

file1.txt
file2.txt
file3.txt

and content of these files to be respectively uno, dos, tres then

awk 'FNR==NR{ARGV[NR 1]=$0;ARGC =1;next}{print FILENAME,$0}' filelist.txt

gives output

file1.txt uno
file2.txt dos
file3.txt tres

Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.

(tested in GNU Awk 5.0.1)

CodePudding user response：

To use it, you can add the following line to your awk script:

parallel -j 8 'awk "$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}"' ::: "${results}"/*_rep"${i}".log

This will run the awk command on each log file in parallel, with up to 8 jobs at a time.