I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.
My first file idnumber.txt
looks like this:
4323-7584
K8933-4943
L2837-0493
The file I am attempting to filter the molecule blocks from has entries as follows:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.6046 1.2567 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4711 0.7499 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4711 -0.2535 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.8766 1.2667 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.8717 -1.7128 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.6046 -0.7552 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.8614 -0.7045 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.6046 -1.7586 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.7382 -0.2383 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.0050 0.7802 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.6097 2.2601 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.0050 -0.2079 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0102 -2.2245 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.7431 -2.2601 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.7433 -2.2094 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-4.3427 1.2515 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.3427 -0.7552 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.7281 -0.1928 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.7129 0.7954 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-3.4711 -2.2449 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.8565 1.2819 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6097 -1.7077 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.4762 -2.2094 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.3429 -1.7077 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2093 0.7499 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2093 -0.2535 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.2093 -2.2094 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0
3 2 1 0 0 0 0
4 3 1 0 0 0 0
5 1 1 0 0 0 0
6 8 1 0 0 0 0
7 4 1 0 0 0 0
8 13 2 0 0 0 0
9 7 1 0 0 0 0
10 1 2 0 0 0 0
11 5 1 0 0 0 0
12 2 2 0 0 0 0
13 11 1 0 0 0 0
14 6 2 0 0 0 0
15 9 2 0 0 0 0
16 6 1 0 0 0 0
17 3 2 0 0 0 0
18 4 2 0 0 0 0
19 20 2 0 0 0 0
20 22 1 0 0 0 0
21 9 1 0 0 0 0
22 11 2 0 0 0 0
23 16 1 0 0 0 0
24 23 1 0 0 0 0
25 24 1 0 0 0 0
26 17 1 0 0 0 0
27 26 2 0 0 0 0
28 25 1 0 0 0 0
19 8 1 0 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (K784-9550)
K784-9550
$$$$
Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --
I have tried a few options of sed
trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:
#!/bin/bash
cat idnumbers.txt | while read line
do
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done
But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf
.
CodePudding user response:
Maybe this is what you are looking for
#!/usr/bin/bash
while read IDNUMBER; do
awk '
/^[[:space:]]*-ISIS-[[:space:]]*--[[:space:]]*StrEd[[:space:]]*--/ {
found=1
}
found == 1 {
text=sprintf("%s%s\n", text, $0)
}
/^>[[:space:]]*<IDNUMBER>[[:space:]]*\('"${IDNUMBER}"'\)/ {
found=2
}
END {
if (found == 2) {
printf text
}
}' \
compound_library.sdf > ${IDNUMBER}.sdf
done < idnumber.txt
CodePudding user response:
I am missing something
In this command
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
you are using following regular expressions
^-ISIS-$
^$line$
^
denotes start of line, $
denotes end of line
1st is looking for -ISIS-
spanning whole line, whilst your file has
-ISIS- -- StrEd --
that is -ISIS-
as part of line, therefore you should use regular expression without anchors that is -ISIS-
2nd does include $
and then some other characters (line
) implying some character being after end, which is impossible, so your code will keeping p
rinting until all file is made, I have not idea if this is desired behavior, but be warned that more common way to do so in GNU sed
is using $
as address (meaning last line) for example if you want to print first line holding digit and all following you could do
sed -n '/[0-9]/,$p' file.txt