Home > Blockchain >  Extract data from sdf file in bash environment
Extract data from sdf file in bash environment

Time:05-13

I want to extract data from an SDF file.

I want to save the > <Name> and > <SCORE.INTER> values in a .tsv file. Is there any way for a quick solution e.g. via awk? Thanks in advance.

The SDF file consists of thousands of Block's. One block of the file looks like this:

ZINC000169748276

 38 39  0  0  0  0  0  0  0  0999 V2000
   11.2318    3.6419   22.3134 C   0  0  0  0  0  0
   12.5621    3.7685   22.2617 C   0  0  0  0  0  0
   13.0725    5.1806   22.3121 C   0  0  0  0  0  0
   10.8850    6.0303   22.4462 C   0  0  0  0  0  0
   13.4310    2.6268   22.1614 C   0  0  0  0  0  0
   12.9848    1.3691   22.0592 C   0  0  0  0  0  0
    8.2548    4.7608   21.1375 C   0  0  0  0  0  0
    7.1479    3.7322   21.1132 C   0  0  0  0  0  0
    7.7728    2.5366   21.8185 C   0  0  0  0  0  0
    8.9539    4.4605   22.4534 C   0  0  0  0  0  0
   13.8873    0.1824   21.9500 C   0  0  0  0  0  0
    8.5117    1.6060   20.8656 C   0  0  0  0  0  0
   12.2544    6.2009   22.3970 N   0  0  0  0  0  0
   10.3635    4.7178   22.4055 N   0  0  0  0  0  0
   14.4254    5.4429   22.2718 N   0  0  0  0  0  0
   13.7646   -0.5167   20.6443 N   0  3  0  0  0  0
    6.5529   -4.6019   19.9460 O   0  5  0  0  0  0
    8.2203   -4.0310   21.8048 O   0  5  0  0  0  0
    6.8149    1.6459   17.3793 O   0  5  0  0  0  0
    5.4231   -2.1179   18.5726 O   0  5  0  0  0  0
   10.1403    7.0090   22.5243 O   0  0  0  0  0  0
    5.7155   -3.6365   22.1679 O   0  0  0  0  0  0
    5.6431    1.8811   19.7228 O   0  0  0  0  0  0
    5.0295   -0.6218   20.7059 O   0  0  0  0  0  0
    8.7342    3.0736   22.7475 O   0  0  0  0  0  0
    6.0324    4.2091   21.8626 O   0  0  0  0  0  0
    8.1857    1.9631   19.5323 O   0  0  0  0  0  0
    7.0232   -2.2197   20.5667 O   0  0  0  0  0  0
    7.0081   -0.1966   19.1450 O   0  0  0  0  0  0
    6.8632   -3.7464   21.1697 P   0  0  0  0  0  0
    6.7991    1.4009   18.8725 P   0  0  0  0  0  0
    5.9605   -1.3044   19.7288 P   0  0  0  0  0  0
   15.0444    4.6730   22.2089 H   0  0  0  0  0  0
   14.7148    6.3890   22.3078 H   0  0  0  0  0  0
   14.3405   -1.3642   20.6292 H   0  0  0  0  0  0
   14.0706    0.0896   19.8769 H   0  0  0  0  0  0
   12.7928   -0.7891   20.4667 H   0  0  0  0  0  0
    5.3352    3.5319   21.8055 H   0  0  0  0  0  0
  1  2  2  0  0  0
  1 14  1  0  0  0
  2  3  1  0  0  0
  2  5  1  0  0  0
  3 13  2  0  0  0
  3 15  1  0  0  0
  4 13  1  0  0  0
  4 14  1  0  0  0
  4 21  2  0  0  0
  5  6  2  0  0  0
  6 11  1  0  0  0
  7  8  1  0  0  0
  7 10  1  0  0  0
  8  9  1  0  0  0
  8 26  1  0  0  0
  9 12  1  0  0  0
  9 25  1  0  0  0
 10 14  1  0  0  0
 10 25  1  0  0  0
 11 16  1  0  0  0
 12 27  1  0  0  0
 17 30  1  0  0  0
 18 30  1  0  0  0
 19 31  1  0  0  0
 20 32  1  0  0  0
 22 30  2  0  0  0
 23 31  2  0  0  0
 24 32  2  0  0  0
 27 31  1  0  0  0
 28 30  1  0  0  0
 28 32  1  0  0  0
 29 31  1  0  0  0
 29 32  1  0  0  0
 15 33  1  0  0  0
 15 34  1  0  0  0
 16 35  1  0  0  0
 16 36  1  0  0  0
 16 37  1  0  0  0
 26 38  1  0  0  0
M  END
>  <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579

>  <Name>
ZINC000169748276

>  <RI>
1.76083e 07


>  <Rbt.Executable>
rbdock/0.1.0

>  <Rbt.Library>
librxdock.so/0.1.0

>  <SCORE>
-41.7582

>  <SCORE.INTER>
-41.8551

>  <SCORE.INTER.CONST>
1

>  <SCORE.INTER.POLAR>
-4.96496

>  <SCORE.INTER.REPUL>
0

>  <SCORE.INTER.ROT>
10

>  <SCORE.INTER.VDW>
-40.3742

>  <SCORE.INTER.norm>
-1.30797

>  <SCORE.INTRA>
0.0969082

>  <SCORE.INTRA.DIHEDRAL>
-5.79141

>  <SCORE.INTRA.DIHEDRAL.0>
19.5819

>  <SCORE.INTRA.POLAR>
0

>  <SCORE.INTRA.POLAR.0>
0

>  <SCORE.INTRA.REPUL>
0

>  <SCORE.INTRA.REPUL.0>
0

>  <SCORE.INTRA.VDW>
2.99261

>  <SCORE.INTRA.VDW.0>
-5.2787

>  <SCORE.INTRA.norm>
0.00302838

>  <SCORE.RESTR>
0

>  <SCORE.RESTR.CAVITY>
0

>  <SCORE.RESTR.norm>
0

>  <SCORE.SYSTEM>
0

>  <SCORE.SYSTEM.CONST>
0

>  <SCORE.SYSTEM.DIHEDRAL>
0

>  <SCORE.SYSTEM.norm>
0

>  <SCORE.heavy>
32

>  <SCORE.norm>
-1.30494

$$$$

The .tsv file should look like this:

ZINC000169748276    -41.8551
ZINC000079214514    -41.7892
ZINC000195993528    -40.9293

CodePudding user response:

Using any awk:

$ awk -v OFS='\t' '
    /^>/ { tag=$2; next }
    NF { f[tag]=$1 }
    $0 == "$$$$" { print f["<Name>"], f["<SCORE.INTER>"] }
' file
ZINC000169748276        -41.8551

The above assumes a line containing $$$$ is what's used to separate your input records.

Note that with this approach of first creating an array (f[] above) that maps the tags/names to their values you can print whatever values you like in whatever order you like, convert the whole thing to a CSV, compare values with other values by their names, etc. e.g. you can write things like this to analyze areas of your data and output reports, etc:

awk -v OFS='\t' '
    /^>/ { tag=$2; next }
    NF { f[tag]=$1 }
    $0 == "$$$$" {
        if (    (f["<SCORE.INTRA.POLAR>"] >= f["<SCORE.INTRA.REPUL>"]) &&
                (f["<SCORE.RESTR.CAVITY>"] == 27) ) {
            print f["<Name>"]
            for ( tag in f ) {
                if ( tag ~ /SCORE/ ) {
                    print f[tag]
                }
            }
        }
    }
' file

If you're ever considering using getline then please see http://awk.freeshell.org/AllAboutGetline for why it's usually the wrong approach.

CodePudding user response:

Why awk?

Prompt> grep -A 1 -i "<NAME>" test.txt | tail -n 1
ZINC000169748276
Prompt> grep -A 1 -i "<SCORE.INTER>" test.txt | tail -n 1
-41.8551

As you see, grep is far easier.

-A 1 means "also take the next 1 line(s)".

After some discussion, this is the final solution:

grep -A 1 -i "<SCORE.INTER>" test.sdf | grep -v '^>' | grep -v '^--' >> results

CodePudding user response:

I want to save the > <NAME> and > <SCORE.INTER> values in a .tsv file. Is there any way for a quick solution e.g. via awk?

Your file has > <Name> not > <NAME> (important difference if you match in case-sensitive way). I would use GNU AWK for this task following way (this assume > <Name> is often before > <SCORE.INTER> and each > <SCORE.INTER> has correpsonding > <Name>) let file.txt content be

ZINC000169748276

 38 39  0  0  0  0  0  0  0  0999 V2000
   11.2318    3.6419   22.3134 C   0  0  0  0  0  0
   12.5621    3.7685   22.2617 C   0  0  0  0  0  0
   13.0725    5.1806   22.3121 C   0  0  0  0  0  0
   10.8850    6.0303   22.4462 C   0  0  0  0  0  0
   13.4310    2.6268   22.1614 C   0  0  0  0  0  0
   12.9848    1.3691   22.0592 C   0  0  0  0  0  0
    8.2548    4.7608   21.1375 C   0  0  0  0  0  0
    7.1479    3.7322   21.1132 C   0  0  0  0  0  0
    7.7728    2.5366   21.8185 C   0  0  0  0  0  0
    8.9539    4.4605   22.4534 C   0  0  0  0  0  0
   13.8873    0.1824   21.9500 C   0  0  0  0  0  0
    8.5117    1.6060   20.8656 C   0  0  0  0  0  0
   12.2544    6.2009   22.3970 N   0  0  0  0  0  0
   10.3635    4.7178   22.4055 N   0  0  0  0  0  0
   14.4254    5.4429   22.2718 N   0  0  0  0  0  0
   13.7646   -0.5167   20.6443 N   0  3  0  0  0  0
    6.5529   -4.6019   19.9460 O   0  5  0  0  0  0
    8.2203   -4.0310   21.8048 O   0  5  0  0  0  0
    6.8149    1.6459   17.3793 O   0  5  0  0  0  0
    5.4231   -2.1179   18.5726 O   0  5  0  0  0  0
   10.1403    7.0090   22.5243 O   0  0  0  0  0  0
    5.7155   -3.6365   22.1679 O   0  0  0  0  0  0
    5.6431    1.8811   19.7228 O   0  0  0  0  0  0
    5.0295   -0.6218   20.7059 O   0  0  0  0  0  0
    8.7342    3.0736   22.7475 O   0  0  0  0  0  0
    6.0324    4.2091   21.8626 O   0  0  0  0  0  0
    8.1857    1.9631   19.5323 O   0  0  0  0  0  0
    7.0232   -2.2197   20.5667 O   0  0  0  0  0  0
    7.0081   -0.1966   19.1450 O   0  0  0  0  0  0
    6.8632   -3.7464   21.1697 P   0  0  0  0  0  0
    6.7991    1.4009   18.8725 P   0  0  0  0  0  0
    5.9605   -1.3044   19.7288 P   0  0  0  0  0  0
   15.0444    4.6730   22.2089 H   0  0  0  0  0  0
   14.7148    6.3890   22.3078 H   0  0  0  0  0  0
   14.3405   -1.3642   20.6292 H   0  0  0  0  0  0
   14.0706    0.0896   19.8769 H   0  0  0  0  0  0
   12.7928   -0.7891   20.4667 H   0  0  0  0  0  0
    5.3352    3.5319   21.8055 H   0  0  0  0  0  0
  1  2  2  0  0  0
  1 14  1  0  0  0
  2  3  1  0  0  0
  2  5  1  0  0  0
  3 13  2  0  0  0
  3 15  1  0  0  0
  4 13  1  0  0  0
  4 14  1  0  0  0
  4 21  2  0  0  0
  5  6  2  0  0  0
  6 11  1  0  0  0
  7  8  1  0  0  0
  7 10  1  0  0  0
  8  9  1  0  0  0
  8 26  1  0  0  0
  9 12  1  0  0  0
  9 25  1  0  0  0
 10 14  1  0  0  0
 10 25  1  0  0  0
 11 16  1  0  0  0
 12 27  1  0  0  0
 17 30  1  0  0  0
 18 30  1  0  0  0
 19 31  1  0  0  0
 20 32  1  0  0  0
 22 30  2  0  0  0
 23 31  2  0  0  0
 24 32  2  0  0  0
 27 31  1  0  0  0
 28 30  1  0  0  0
 28 32  1  0  0  0
 29 31  1  0  0  0
 29 32  1  0  0  0
 15 33  1  0  0  0
 15 34  1  0  0  0
 16 35  1  0  0  0
 16 36  1  0  0  0
 16 37  1  0  0  0
 26 38  1  0  0  0
M  END
>  <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579

>  <Name>
ZINC000169748276

>  <RI>
1.76083e 07


>  <Rbt.Executable>
rbdock/0.1.0

>  <Rbt.Library>
librxdock.so/0.1.0

>  <SCORE>
-41.7582

>  <SCORE.INTER>
-41.8551

>  <SCORE.INTER.CONST>
1

>  <SCORE.INTER.POLAR>
-4.96496

>  <SCORE.INTER.REPUL>
0

>  <SCORE.INTER.ROT>
10

>  <SCORE.INTER.VDW>
-40.3742

>  <SCORE.INTER.norm>
-1.30797

>  <SCORE.INTRA>
0.0969082

>  <SCORE.INTRA.DIHEDRAL>
-5.79141

>  <SCORE.INTRA.DIHEDRAL.0>
19.5819

>  <SCORE.INTRA.POLAR>
0

>  <SCORE.INTRA.POLAR.0>
0

>  <SCORE.INTRA.REPUL>
0

>  <SCORE.INTRA.REPUL.0>
0

>  <SCORE.INTRA.VDW>
2.99261

>  <SCORE.INTRA.VDW.0>
-5.2787

>  <SCORE.INTRA.norm>
0.00302838

>  <SCORE.RESTR>
0

>  <SCORE.RESTR.CAVITY>
0

>  <SCORE.RESTR.norm>
0

>  <SCORE.SYSTEM>
0

>  <SCORE.SYSTEM.CONST>
0

>  <SCORE.SYSTEM.DIHEDRAL>
0

>  <SCORE.SYSTEM.norm>
0

>  <SCORE.heavy>
32

>  <SCORE.norm>
-1.30494

$$$$

then

awk '/^>  <Name>/{getline;printf "%s\t",$0}/^>  <SCORE\.INTER>/{getline;print $0}' file.txt

output

ZINC000169748276    -41.8551

Explanation: getline causes GNU AWK to load next line, therefore $0 becomes content of line after current line. When > <Name> at start of line (^) is encountered load next line and print it followed by TAB for line starting with > <SCORE.INTER> load next line and print it. Note that . needs to be escaped as it has special meaning.

(tested in gawk 4.2.1)

  • Related