Extract compound data from SDF file using IDNUMBER and write to a new file-CodePudding

I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.

My first file idnumber.txtlooks like this:

4323-7584
K8933-4943
L2837-0493

The file I am attempting to filter the molecule blocks from has entries as follows:

  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.6046    1.2567    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4711    0.7499    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4711   -0.2535    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8766    1.2667    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.8717   -1.7128    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.6046   -0.7552    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.8614   -0.7045    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.6046   -1.7586    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7382   -0.2383    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0050    0.7802    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.6097    2.2601    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0050   -0.2079    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0102   -2.2245    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7431   -2.2601    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.7433   -2.2094    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -4.3427    1.2515    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.3427   -0.7552    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7281   -0.1928    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7129    0.7954    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4711   -2.2449    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8565    1.2819    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.6097   -1.7077    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4762   -2.2094    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.3429   -1.7077    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2093    0.7499    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2093   -0.2535    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.2093   -2.2094    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  3  2  1  0  0  0  0
  4  3  1  0  0  0  0
  5  1  1  0  0  0  0
  6  8  1  0  0  0  0
  7  4  1  0  0  0  0
  8 13  2  0  0  0  0
  9  7  1  0  0  0  0
 10  1  2  0  0  0  0
 11  5  1  0  0  0  0
 12  2  2  0  0  0  0
 13 11  1  0  0  0  0
 14  6  2  0  0  0  0
 15  9  2  0  0  0  0
 16  6  1  0  0  0  0
 17  3  2  0  0  0  0
 18  4  2  0  0  0  0
 19 20  2  0  0  0  0
 20 22  1  0  0  0  0
 21  9  1  0  0  0  0
 22 11  2  0  0  0  0
 23 16  1  0  0  0  0
 24 23  1  0  0  0  0
 25 24  1  0  0  0  0
 26 17  1  0  0  0  0
 27 26  2  0  0  0  0
 28 25  1  0  0  0  0
 19  8  1  0  0  0  0
 18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (K784-9550)
K784-9550

$$$$

Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --

I have tried a few options of sed trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:

#!/bin/bash
cat idnumbers.txt | while read line
do
  sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done

But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf.

CodePudding user response：

Maybe this is what you are looking for

#!/usr/bin/bash

while read IDNUMBER; do
  awk '
    /^[[:space:]]*-ISIS-[[:space:]]*--[[:space:]]*StrEd[[:space:]]*--/ {
      found=1
      }
    found == 1 {
      text=sprintf("%s%s\n", text, $0)
      }
    /^>[[:space:]]*<IDNUMBER>[[:space:]]*\('"${IDNUMBER}"'\)/ {
      found=2
      }
    END {
      if (found == 2) {
        printf text
        }
      }' \
compound_library.sdf > ${IDNUMBER}.sdf
done < idnumber.txt

CodePudding user response：

I am missing something

In this command

sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf

you are using following regular expressions

^-ISIS-$
^$line$

^ denotes start of line, $ denotes end of line

1st is looking for -ISIS- spanning whole line, whilst your file has

  -ISIS-  -- StrEd --

that is -ISIS- as part of line, therefore you should use regular expression without anchors that is -ISIS-

2nd does include $ and then some other characters (line) implying some character being after end, which is impossible, so your code will keeping printing until all file is made, I have not idea if this is desired behavior, but be warned that more common way to do so in GNU sed is using $ as address (meaning last line) for example if you want to print first line holding digit and all following you could do

sed -n '/[0-9]/,$p' file.txt