awk: processing log and search pattern-CodePudding

I am working with the log filles arranged in the following format:

fƒdfFinding intramodel H-bonds
Constraints relaxed by 0.5 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.11 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.12 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.13 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.14 SarsCov2_structure49R_nsp5holo_rep1.pdb

14 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 ND2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 1HD2   3.102  2.145
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 H      3.011  2.024
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 H      3.037  2.132
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? HIS 163 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   no hydrogen                                                   3.388  N/A
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 H      2.806  1.792
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 N      SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 H       3.093  2.142
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 H      3.030  2.193
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 2HE2   3.052  2.301
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 H     2.854  1.868
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 H     3.103  2.070
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 H     3.161  2.224
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 SG   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 HG    3.421  2.842
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 ND2  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 2HD2  3.055  2.465
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 H     2.924  2.143

I need to find the first occurence of the "GLU 166 N" pattern and print the number present on the same line just before the pattern as #1.number/?, associated with this pattern. So in the example the detected number should be 3 (since the associating number is #1.3/?).

I would start from basic pattern-detection

awk '/GLU 166 N/' file

but how to find correctly the number defined just before the pattern and print it as output ? Finally, in the case if the pattern can not be found, I would like that the script prints 1.

CodePudding user response：

$ awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' file
3
$ awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' /dev/null
1

What you look for is in the second field ($2). gsub(/.*\.|\/\?/,"",$2) replaces in $2 all leading characters up to (and including) the period, and the trailing /? by the empty string.

CodePudding user response：

Using GNU awk for the 3rd arg to match():

$ awk 'match($0,/([0-9] ).. GLU 166 N /,a){print a[1]; exit}' file
3

or using any awk:

$ awk 'match($0,/[0-9] .. GLU 166 N /){sub("/.*",""); print substr($0,RSTART); exit}' file
3

$ awk 'match($0,/[0-9] .. GLU 166 N /){print substr($0,RSTART,RLENGTH-13); exit}' file
3

CodePudding user response：

If GNU awk which supports gensub function is available, would you please try:

awk '/GLU 166 N/ {
    print gensub(/^.*#1\.([0-9] )\/\? GLU 166 N.*$/, "\\1", 1)
    exit
}'  file

The regex ^.*#1\\.([0-9] )/\\? GLU 166 N.*$ matches the line with the substring #1.<number>/? "GLU 166 N. The <number> portion, which is enclosed with the parentheses in the regex as ([0-9] ) is captured as group 1, then the entire line is replaced with the group 1, which is specified as the replacement \\1, then it is printed as the result.
Alternatively you can say with GNU sed as:

sed -nE '0,/GLU 166 N/s|^.*#1\.([0-9] )/\? GLU 166 N.*|\1|p' file

The address 0,/pattern/, where 0 is specific to GNU sed as a starting line, makes the script exit immediately after the 1st pattern match.