I am working with a compound database in sdf format. I would like to simple replace the head title of all molecules (with the pattern $$$$ before the title) by the line followed by > <GENERIC_NAME>.
The file looks like this:
$$$$
91443
-OEChem-10051719083D
55 57 0 1 0 0 0 0 0999 V2000
-5.0661 -1.1129 2.4181 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2383 1.9583 1.7563 O 0 0 0 0 0 0 0 0 0 0 0 0
7.3280 0.6119 -1.9919 O 0 0 0 0 0 0 0 0 0 0 0 0
5.1868 0.6987 -2.7387 O 0 0 0 0 0 0 0 0 0 0 0 0
<GENERIC_NAME> Tetrahydrofolic acid
The script should replace 91443 by Tetrahydrofolic acid.
Thanks in advance. Best Regards.
I tried in a simple way GREP both patterns and replace by sed with no result:
a=$(grep -A 1 --no-group-separator "$$$$" test.sdf | grep -v "$$$$")
b=$(grep -A 1 --no-group-separator "GENERIC_NAME" test.sdf | grep -v "GENERIC_NAME")
while $a $b,
do
sed -i "s/$a/$b/" test.sdf
done
CodePudding user response:
After correcting some syntax errors, the grep
/sed
combination will only work when you have exactly 1 match in a file. You want to replace all. You can use awk
which is great for logic that spans different lines.
With the GNU sed
option -z
you can try some magic, but additional information would be fine.
When you are sure that you only have the >
character in <GENERIC_NAME>
, you can use
sed -rz 's/(\$\$\$\$\n)[^\n]*\n([^>]*)> ([^\n]*)/\1\3\n\2> \3/g' test.sdf
However, when you can have more >
characters (or <GENERIC_NAME>
is in fact a different string, you might need something like
sed -rz 's/<GENERIC_NAME>/\r/g;
s/(\$\$\$\$\n)[^\n]*\n([^\r]*)\r ([^\n]*)/\1\3\n\2<GENERIC_NAME> \3/g' test.sdf
The sed
commands gets uglier for each iteration (fix for another requirement) you implement, so it is a good idea to look at awk
. Something like
awk -F'>' '
/\$\$\$\$/ { getheader=1; print; next }
getheader {header=$0; getheader=0; getbody=1; next}
/<GENERIC_NAME>/ { printf("%s\n,%s\n%s\n", $2, body, $0); getbody=0; body=""; next}
getbody { body = body $0 "\n"; next}
{print}
' test.sdf
PS: Please edit your question (formatting, example output, show if there can be any lines outside the $$$$
-<GENERIC_NAME>
blocks)
CodePudding user response:
If I understand your question, you want to "replace 91443 by Tetrahydrofolic acid"
, meaning remove 91443
which follows $$$$
and replace it with "Tetrahydrofolic acid"
. If that is the case, then a simple awk
expressions will do that simply sets a flag when "$$$$"
is encountered as a line, replacing the next line with "Tetrahydrofolic acid"
, meaning remove 91443
which follows $$$$
and replace it with "Tetrahydrofolic acid"
, e.g.
awk '
replace { $0 = "Tetrahydrofolic acid"; replace = 0 }
/^\$\$\$\$$/ { replace = 1 }
1' compound.sdf
(the 1
is simply shorthand for the default operation print
)
The variable replace
is simply set to 1
when the next line should be replaced, and back to 0
after the replacement until the next record of "$$$$"
is encountered.
Example Use/Output
With your example input in the file compound.sdf
, you would get:
$ awk '
> replace { $0 = "Tetrahydrofolic acid"; replace = 0 }
> /^\$\$\$\$$/ { replace = 1 }
> 1' compound.sdf
$$$$
Tetrahydrofolic acid
-OEChem-10051719083D
55 57 0 1 0 0 0 0 0999 V2000
-5.0661 -1.1129 2.4181 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2383 1.9583 1.7563 O 0 0 0 0 0 0 0 0 0 0 0 0
7.3280 0.6119 -1.9919 O 0 0 0 0 0 0 0 0 0 0 0 0
5.1868 0.6987 -2.7387 O 0 0 0 0 0 0 0 0 0 0 0 0
Let me know if I misunderstood the question and I'm happy to help further.