awk: searching log on the multiple patterns-CodePudding

I am dealing with the analysis of the data in the many separate log filles. This is the format of each log

Finding intramodel H-bonds
Constraints relaxed by 0.6 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.2 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.3 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.4 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.5 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.6 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.7 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.8 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.9 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.10 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.11 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.12 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.13 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.14 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.15 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.16 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.17 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.18 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.19 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.20 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.21 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.22 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.23 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.24 SarsCov2_mol30_nsp5holoHIE_rep1.pdb

17 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? HIE 163 NE2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/A LIG 888 O2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? HIE 163 HE2    3.250  2.448
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? GLU 166 H      2.817  2.027
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 N      SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 N2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 H       3.453  2.470
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 NE2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 HE2    3.269  2.495
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 H      3.555  2.634
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.4/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.4/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.4/? GLU 166 H      3.622  2.743
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.5/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.5/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.5/? GLU 166 H      2.797  1.790
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.10/? GLU 166 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.10/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.10/? GLU 166 H     3.780  2.783
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.12/? GLU 166 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.12/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.12/? GLU 166 H     3.273  2.541
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.14/? HIE 163 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.14/A LIG 888 O2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.14/? HIE 163 HE2   3.389  2.556
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? ASN 142 ND2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/A LIG 888 O2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? ASN 142 2HD2  3.067  2.303
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? GLY 143 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/A LIG 888 N2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? GLY 143 H     2.962  2.016
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.16/? GLU 166 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.16/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.16/? GLU 166 H     2.926  1.930
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.19/? GLN 189 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.19/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.19/? GLN 189 1HE2  3.026  2.212
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? GLY 143 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? GLY 143 H     2.855  1.848
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? HIE 163 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/A LIG 888 O2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? HIE 163 HE2   3.345  2.400
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.24/? GLN 189 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.24/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.24/? GLN 189 1HE2  2.893  2.286

I need to consider each line after the string

H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):

From the rest lines I need to check whether three keywords:

GLU 166
HIE 163
THR 26

are present in the same index (defined as 1.1 , 1.2 ... 1.24) and then print the name of the log the ID of the index value (in the second column). In the log, the index value is 1.2 since the three keywords are:

SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 N      SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 N2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 H       3.453  2.470
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 NE2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 HE2    3.269  2.495
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 H      3.555  2.634

so the expected output should be:

log_name.log the patterns are found in the #1.2!

I've tried to loop each log using simple bash workflow with awk code that looked for 1 pattern but could not do it with three patterns belonged to the same index #

for log in /logs/*hbondsALL_rep"${i}".log; do
  log_name=$(basename "$log" .log | cut -d'_' -f 2)
  # search only one pattern GLU 166
  i=$(awk -vn=1 '/GLU 166/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' $log)
  # insert here alternative search solution which check the patterns!
  # and find the index {i} in the log
  # log_name.log the patterns are found in the # {i} 
done

may I use sed or AWK for such pattern based search intergrated in bash?

CodePudding user response：

Assuming (index,keyword) pairs are unique within each file, a quick-and-dirty solution using bash and standard utilities would be like this:

#!/bin/bash

for log in *.log; do
    idx=$(sed -nE '/^H-bonds/,$s/.*(#[0-9.] ).. (GLU|HIE|THR) .*/\1/p' "$log" |
          sort |
          uniq -c |
          sed -n 's/^[[:blank:]]*3[[:blank:]]\(.*\)/\1/p')
    if [[ $idx ]]; then
        printf '%s: the patterns found in the %s\n' "$log" "${idx//$'\n'/' '}"
    fi
done

Edit: It turned out that the assumption above was false. The version below should work without that assumption.

#!/bin/bash

for log in *.log; do
    idx=$(sed -nE '/^H-bonds/,$s/[^ ]* (#[0-9.] )\/\? (GLU|HIE|THR) .*/\1 \2/p' "$log" |
          sort -u |
          sed 's/ .*//' |
          uniq -c |
          sed -n 's/^[[:blank:]]*3[[:blank:]]\(.*\)/\1/p')
    if [[ $idx ]]; then
        printf '%s: the patterns found in the %s\n' "$log" "${idx//$'\n'/' '}"
    fi
done

Edit 2: An awk version could be:

for logfile in *.log; do
    awk -F '[ /]' '
        /^H-bonds/ { f=1; next }
        f { a[$2$4]
            if ( ($2 "GLU" in a) && ($2 "HIE" in a) && ($2 "THR" in a) ) {
                print FILENAME ": the patterns are found in the " $2
            }
          }
    ' "$logfile"
done

CodePudding user response：

Using the example log you gave above, and somehat simplistic assumptions -
If what you want is any file in which any of the patterns match, here's one solution:

$: awk '/ GLU 166 | HIE 163 | THR 26 /{ gsub("[/?] ","!",$2); print FILENAME" the patterns are found in the "$2; exit; }' log
log the patterns are found in the #1.1!

If you only want files and indexes where ALL the patterns are found on a given index, that's a little trickier, but still not so bad.

$: awk '/ GLU 166 | HIE 163 | THR 26 / { 
          if ( hit[$2] ~ " "$3$4" " ) { next; } else {  hit[$2]=" "hit[$2]$3$4" " } }
        END{ for (ndx in hit) {
          if ( hit[ndx] ~ / GLU166 / && hit[ndx] ~ / HIE163 / && hit[ndx] ~ / THR26 / ) { 
            gsub("[/?] ","!",ndx); 
            print FILENAME" the patterns are found in the "ndx } } }' log
log the patterns are found in the #1.2!

This will also be much faster on log files of any size than a shell script, as all the logic is encapsulated in a single invocation of awk. If the files are really big, we could add several streamlining improvements.

With a little more work we could factor the search patterns out to another file, for example. :)

CodePudding user response：

Assumptions:

within a file an index/keyword pair (eg, #1.2 / GLU 1661) is unique
within a file all lines of interest are sorted by index (eg, #1.1 before #1.2 before #1.3 ...)

One awk idea that allows the user to supply a list of keywords via a bash variable:

keywords='GLU 166,HIE 163,THR 26'

awk -v keywords="${keywords}" '

function print_match() {

    if (found_cnt == key_cnt) {                        # if all keys were found then ...
       split(ndx,a,"/")                                # strip off "/..." from ndx and print message:
       print FILENAME,"the patterns are found in the",a[1] "!"
    #  found_hb=0                                      # uncomment to  print only the first matching index in a file
    }
    found_cnt=0
}

BEGIN      { key_cnt=split(keywords,a,",")             # parse input parameter "keywords"
             for (i=1;i<=key_cnt;i  )                  # convert to an associative array where keys are the array indices
                 keys[a[i]]
           }

FNR==1     { print_match()                             # new file? flush previous index details and ... 
             found_hb=0                                # disable testing for keywords
           }
/^H-bonds/ { found_hb=1; next }                        # enable testing for keywords?
found_hb   { if ($2 != ndx) {                          # if this is a new index then ...
                print_match()                          # flush previous index details and ...
                ndx=$2                                 # make note of new index
             }
             key=$3 FS $4
             if (key in keys)                          # if current key is an index in our associative array keys[] then ...
                found_cnt                              # increment our count
           }
END        { print_match() }                           # flush last index details
' test.log

NOTE: for a large number of keywords (to search for) I'd probably opt for storing them in a file which in turn would require a few tweaks of this code to load said file (of keywords) into the keys[] associative array, but that's for another day and a different Q&A session ...

Taking for a test drive ...

When keywords='GLU 166,HIE 163,THR 26':

test.log the patterns are found in the #1.2!

When keywords='GLU 166,HIE 163':

test.log the patterns are found in the #1.1!
test.log the patterns are found in the #1.2!           # does not print if the 'found_hb=0' line is uncommented in the print_match() function

When keywords='ASN 142,GLY 143':

test.log the patterns are found in the #1.15!

When keywords='ASN 142,HIE 163':

           <<<=== no output

Making some copies of the log file:

$ cp test.log test.log2
$ cp test.log test.log3

Feeding all 3 log files to the awk script:

awk -v keywords="${keywords}" '
function print_match() {
... snip ...
END        { print_match() }
' test.log test.log1 test.log2

And running with keywords='GLU 166,HIE 163':

test.log the patterns are found in the #1.1!
test.log the patterns are found in the #1.2!
test.log2 the patterns are found in the #1.1!
test.log2 the patterns are found in the #1.2!
test.log3 the patterns are found in the #1.1!
test.log3 the patterns are found in the #1.2!