I found this code for parsing a sdf file but I cannot ignore the whitespace that's why Ki (nm) output does not show.
My file look like this:
> <Ligand InChI Key>
CPZBLNMUGSZIPR-NVXWUHKLSA-N
> <BindingDB MonomerID>
50417287
> <BindingDB Ligand Name>
Aloxi::Aurothioglucose::PALONOSETRON::PALONOSETRON HYDROCHLORIDE
> <Target Name Assigned by Curator or DataSource>
5-hydroxytryptamine receptor 3A
> <Target Source Organism According to Curator or DataSource>
Homo sapiens
> <Ki (nM)>
0.0316
> <IC50 (nM)>
> <Kd (nM)>
> <EC50 (nM)>
---------------------------
awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf
Thank you!
CodePudding user response:
Please try match()
function to extract the tag between <
and >
inclusive.
awk -v OFS='\t' '
/^>/ { match($0, /<. >/); tag = substr($0, RSTART, RLENGTH); next }
NF { f[tag]=$1 }
$0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf
- The function
match($0, /<. >/)
returns a non-zero value if the regex<. >
matches$0
assigning awk variablesRSTART
andRLENGTH
to the start position and the length of the matched substring. - The regex
<. >
matches a substring which starts with<
and ends with>
. The substring may contain whitespace characters. substr($0, RSTART, RLENGTH)
returns the substring of$0
starting atRSTART
and length ofRLENGTH
characters. Then the variabletag
is assigned to it.