Home > Mobile >  sed: pattern search in segmented data
sed: pattern search in segmented data

Time:03-11

I am operating with the data separated on several parts using ---, where the ID of the block is indicated at the begining as the begining of each block

# an example with 4 blocks: 06I, 5p9, Y6J, jacks18

06I: 18 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? THR 26 N       #1.1/A UNL 1 O      #1.1/? THR 26 H      3.515  2.716
 #1.1/? ASN 142 ND2    #1.1/A UNL 1 O      #1.1/? ASN 142 2HD2  3.227  2.305
 #1.1/A UNL 1 N        #1.1/? THR 26 O     #1.1/A UNL 1 H       3.463  2.652
 #1.2/A UNL 1 N        #1.2/? PHE 140 O    #1.2/A UNL 1 H       2.987  2.200
 #1.4/? THR 26 N       #1.4/A UNL 1 S      #1.4/? THR 26 H      4.354  3.371
 #1.4/? HIS 163 NE2    #1.4/A UNL 1 N     no hydrogen            3.137  N/A
 #1.4/A UNL 1 N        #1.4/? ARG 188 O    #1.4/A UNL 1 H       3.000  2.081
 #1.5/? HIS 163 NE2    #1.5/A UNL 1 N     no hydrogen            3.330  N/A
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O      #1.5/? GLN 189 2HE2  3.029  2.132
 #1.6/A UNL 1 N        #1.6/? ARG 188 O    #1.6/A UNL 1 H       2.984  2.064
 #1.8/? ASN 142 ND2    #1.8/A UNL 1 N      #1.8/? ASN 142 2HD2  3.164  2.395
 #1.8/? ASN 142 ND2    #1.8/A UNL 1 O      #1.8/? ASN 142 2HD2  3.031  2.180
 #1.8/? GLN 189 NE2    #1.8/A UNL 1 O      #1.8/? GLN 189 1HE2  3.276  2.553
 #1.8/A UNL 1 N        #1.8/? THR 190 O    #1.8/A UNL 1 H       3.257  2.407
 #1.9/A UNL 1 N        #1.9/? THR 190 O    #1.9/A UNL 1 H       2.913  2.037
 #1.10/? SER 144 OG    #1.10/A UNL 1 S     #1.10/? SER 144 HG   4.246  3.845
 #1.10/? HIS 163 NE2   #1.10/A UNL 1 S    no hydrogen            3.700  N/A
 #1.10/A UNL 1 N       #1.10/? THR 190 O   #1.10/A UNL 1 H      3.008  2.091
-----------------------------------------------------------------------------
5p9: 12 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? GLY 143 N      #1.1/A 5P9 1 O2    #1.1/? GLY 143 H      2.939  2.013
 #1.1/? CYS 145 SG     #1.1/A 5P9 1 N2    #1.1/? CYS 145 HG     3.678  2.679
 #1.1/? CYS 145 SG     #1.1/A 5P9 1 O2    #1.1/? CYS 145 HG     3.584  2.963
 #1.1/? HIS 163 NE2    #1.1/A 5P9 1 O1   no hydrogen            3.307  N/A
 #1.2/? ASN 142 ND2    #1.2/A 5P9 1 N2    #1.2/? ASN 142 2HD2   3.413  2.583
 #1.4/? ASN 142 ND2    #1.4/A 5P9 1 O2    #1.4/? ASN 142 2HD2   3.032  2.290
 #1.5/? GLN 189 NE2    #1.5/A 5P9 1 O1    #1.5/? GLN 189 1HE2   3.546  2.574
 #1.9/? GLY 143 N      #1.9/A 5P9 1 N2    #1.9/? GLY 143 H      3.241  2.345
 #1.9/? GLY 143 N      #1.9/A 5P9 1 O2    #1.9/? GLY 143 H      3.158  2.273
 #1.9/? GLN 189 NE2    #1.9/A 5P9 1 O1    #1.9/? GLN 189 1HE2   3.265  2.561
 #1.10/? ASN 142 ND2   #1.10/A 5P9 1 O2   #1.10/? ASN 142 2HD2  3.080  2.518
 #1.11/? ASN 142 ND2   #1.11/A 5P9 1 O2   #1.11/? ASN 142 1HD2  2.942  2.261
-----------------------------------------------------------------------------
Y6J: 19 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? SER 144 OG    #1.1/A UNL 1 S       #1.1/? SER 144 HG    4.242  3.841
 #1.1/? HIS 163 NE2   #1.1/A UNL 1 S      no hydrogen            3.869  N/A
 #1.1/? GLN 189 NE2   #1.1/A UNL 1 O       #1.1/? GLN 189 1HE2  3.192  2.191
 #1.1/? GLN 189 NE2   #1.1/A UNL 1 O       #1.1/? GLN 189 2HE2  3.129  2.463
 #1.2/? GLN 189 NE2   #1.2/A UNL 1 O       #1.2/? GLN 189 1HE2  3.244  2.245
 #1.2/? GLN 189 NE2   #1.2/A UNL 1 O       #1.2/? GLN 189 2HE2  3.145  2.414
 #1.3/? GLN 189 NE2   #1.3/A UNL 1 O       #1.3/? GLN 189 1HE2  2.980  2.036
 #1.4/? GLY 143 N     #1.4/A UNL 1 S       #1.4/? GLY 143 H     3.989  3.296
 #1.4/? SER 144 N     #1.4/A UNL 1 S       #1.4/? SER 144 H     3.910  3.194
 #1.4/? GLN 189 NE2   #1.4/A UNL 1 O       #1.4/? GLN 189 1HE2  3.153  2.331
 #1.5/? HIS 163 NE2   #1.5/A UNL 1 S      no hydrogen            3.901  N/A
 #1.5/? GLN 189 NE2   #1.5/A UNL 1 O       #1.5/? GLN 189 1HE2  3.161  2.580
 #1.5/A UNL 1 N       #1.5/? GLU 166 OE2   #1.5/A UNL 1 H       3.147  2.198
 #1.6/? GLY 143 N     #1.6/A UNL 1 N       #1.6/? GLY 143 H     3.145  2.243
 #1.6/? GLN 189 NE2   #1.6/A UNL 1 O       #1.6/? GLN 189 1HE2  2.985  2.119
 #1.6/A UNL 1 N       #1.6/? GLU 166 OE1   #1.6/A UNL 1 H       2.974  2.005
 #1.7/? GLY 143 N     #1.7/A UNL 1 S       #1.7/? GLY 143 H     3.841  2.976
 #1.8/A UNL 1 N       #1.8/? PHE 140 O     #1.8/A UNL 1 H       2.937  2.062
 #1.10/? GLY 143 N    #1.10/A UNL 1 O      #1.10/? GLY 143 H    3.182  2.150
-----------------------------------------------------------------------------
jacks18: 11 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? HIS 163 NE2    #1.1/A V1B 1 N3      no hydrogen            3.189  N/A
 #1.2/? ASN 142 ND2    #1.2/A V1B 1 O2       #1.2/? ASN 142 1HD2   3.089  2.515
 #1.4/? ASN 142 ND2    #1.4/A V1B 1 O2       #1.4/? ASN 142 2HD2   3.258  2.631
 #1.4/? GLY 143 N      #1.4/A V1B 1 N3       #1.4/? GLY 143 H      3.143  2.116
 #1.5/? GLN 189 NE2    #1.5/A V1B 1 O2       #1.5/? GLN 189 1HE2   3.087  2.354
 #1.6/? ASN 142 ND2    #1.6/A V1B 1 O2       #1.6/? ASN 142 2HD2   3.093  2.110
 #1.7/? GLN 189 NE2    #1.7/A V1B 1 O2       #1.7/? GLN 189 1HE2   3.031  2.322
 #1.7/A V1B 1 N1       #1.7/? GLU 166 OE1    #1.7/A V1B 1 H        2.983  2.094
 #1.9/? ASN 142 ND2    #1.9/A V1B 1 O1       #1.9/? ASN 142 2HD2   3.071  2.214
 #1.10/? ASN 142 ND2   #1.10/A V1B 1 O2      #1.10/? ASN 142 2HD2  3.108  2.171
 #1.11/A V1B 1 N1      #1.11/? GLU 166 OE2   #1.11/A V1B 1 H       3.355  2.549
-----------------------------------------------------------------------------

I need to find specified pattern within each of the blocks corresponded to the number (like 26, 142, 140 etc) mentioned after tree letter code (like ASN). Basically I need to obtain the information regarding the first occurance of the pattern in each block. The expected output should include the value from the penultimate column, if the specified number is detected on the same line. E.g. given my input with 4 blocks, for "142", I should obtain:

061: the first occurence of 142 is (3.227). #142  found 3 times
5p9: the first occurence of 142 is (3.413). #142 found 4 times
Y6J: (). 142 found 0 times.
jacks18: the first occurence of 142 is (3.089). #142 found 5 times

I may use sed to identify all linnes contained the pattern:

pattern='142'
sed -n "/${pattern}/p" input.log

Alternatively I may use sed to indicate the string with the fist occurance of the pattern

sed -n "/${pattern}/p; /${pattern}/q" input.log

May you suggest me some approach to adapt these commands for the multi-block structure of the file and print instead the name of the block and the occurense of the pattern according to the model shown above?

CodePudding user response:

Assumptions:

  • a block 'name' always occurs at the beginning of the line, contains no white space, and is terminated with a :
  • no other lines start with what could erroneously be considered a block 'name' (ie, no other lines start with ^<alphanumeric>:)

One awk idea:

awk -v ptn="ASN 142" '                                           # define pattern to search for

function print_findings() {

    if (block)                                                   # if block is non-empty
       if (count)                                                # if count is non-zero
          printf "%s the first occurrence of %s is (%s). #%s found %d times\n", block, ptn, value, ptn, count
       else                                                      # else count=0
          printf "%s (). %s found 0 times.\n", block, ptn
}

$1 ~ /^[[:alnum:]] :$/ { print_findings()                        # flush previous block details
                         block=$1                                # grab new block name
                         count=0                                 # reset
                         value=""                                # reset
                         next
                       }

$0 ~ ptn               { count                                   # if we find "ASN"   ptn then increment counter and ...
                         value= (value == "") ? $(NF-1) : value  # save the value on the first matching line
                       }

END                    { print_findings() }                      # flush last block details
' input.log

NOTE: remove comments to declutter code

For -v ptn="ASN 142" this generates:

06I: the first occurrence of 142 is (3.227). #142 found 3 times
5p9: the first occurrence of 142 is (3.413). #142 found 4 times
Y6J: (). 142 found 0 times.
jacks18: the first occurrence of 142 is (3.089). #142 found 5 times

For -v ptn="GLN 189" this generates:

06I: the first occurrence of GLN 189 is (3.029). #GLN 189 found 2 times
5p9: the first occurrence of GLN 189 is (3.546). #GLN 189 found 2 times
Y6J: the first occurrence of GLN 189 is (3.192). #GLN 189 found 8 times
jacks18: the first occurrence of GLN 189 is (3.087). #GLN 189 found 2 times

For -v ptn="GLU 166" this generates:

06I: (). GLU 166 found 0 times.
5p9: (). GLU 166 found 0 times.
Y6J: the first occurrence of GLU 166 is (3.147). #GLU 166 found 2 times
jacks18: the first occurrence of GLU 166 is (2.983). #GLU 166 found 2 times
  • Related