I want to extract data from an SDF file.
I want to save the > <Name>
and > <SCORE.INTER>
values in a .tsv file.
Is there any way for a quick solution e.g. via awk?
Thanks in advance.
The SDF file consists of thousands of Block's. One block of the file looks like this:
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e 07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
The .tsv file should look like this:
ZINC000169748276 -41.8551
ZINC000079214514 -41.7892
ZINC000195993528 -40.9293
CodePudding user response:
Using any awk:
$ awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" { print f["<Name>"], f["<SCORE.INTER>"] }
' file
ZINC000169748276 -41.8551
The above assumes a line containing $$$$
is what's used to separate your input records.
Note that with this approach of first creating an array (f[]
above) that maps the tags/names to their values you can print whatever values you like in whatever order you like, convert the whole thing to a CSV, compare values with other values by their names, etc. e.g. you can write things like this to analyze areas of your data and output reports, etc:
awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" {
if ( (f["<SCORE.INTRA.POLAR>"] >= f["<SCORE.INTRA.REPUL>"]) &&
(f["<SCORE.RESTR.CAVITY>"] == 27) ) {
print f["<Name>"]
for ( tag in f ) {
if ( tag ~ /SCORE/ ) {
print f[tag]
}
}
}
}
' file
If you're ever considering using getline
then please see http://awk.freeshell.org/AllAboutGetline for why it's usually the wrong approach.
CodePudding user response:
Why awk
?
Prompt> grep -A 1 -i "<NAME>" test.txt | tail -n 1
ZINC000169748276
Prompt> grep -A 1 -i "<SCORE.INTER>" test.txt | tail -n 1
-41.8551
As you see, grep
is far easier.
-A 1
means "also take the next 1 line(s)".
After some discussion, this is the final solution:
grep -A 1 -i "<SCORE.INTER>" test.sdf | grep -v '^>' | grep -v '^--' >> results
CodePudding user response:
I want to save the
> <NAME>
and> <SCORE.INTER>
values in a .tsv file. Is there any way for a quick solution e.g. via awk?
Your file has > <Name>
not > <NAME>
(important difference if you match in case-sensitive way). I would use GNU AWK
for this task following way (this assume > <Name>
is often before > <SCORE.INTER>
and each > <SCORE.INTER>
has correpsonding > <Name>
) let file.txt
content be
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e 07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
then
awk '/^> <Name>/{getline;printf "%s\t",$0}/^> <SCORE\.INTER>/{getline;print $0}' file.txt
output
ZINC000169748276 -41.8551
Explanation: getline
causes GNU AWK
to load next line, therefore $0
becomes content of line after current line. When > <Name>
at start of line (^
) is encountered load next line and print it followed by TAB for line starting with > <SCORE.INTER>
load next line and print it. Note that .
needs to be escaped as it has special meaning.
(tested in gawk 4.2.1)