Bash: replace the line before one pattern by other line that contains other pattern-CodePudding

I am working with a compound database in sdf format. I would like to simple replace the head title of all molecules (with the pattern $$$$ before the title) by the line followed by > <GENERIC_NAME>.

The file looks like this:

$$$$
91443
  -OEChem-10051719083D

 55 57  0     1  0  0  0  0  0999 V2000
   -5.0661   -1.1129    2.4181 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2383    1.9583    1.7563 O   0  0  0  0  0  0  0  0  0  0  0  0
    7.3280    0.6119   -1.9919 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1868    0.6987   -2.7387 O   0  0  0  0  0  0  0  0  0  0  0  0

<GENERIC_NAME> Tetrahydrofolic acid

The script should replace 91443 by Tetrahydrofolic acid.

Thanks in advance. Best Regards.

I tried in a simple way GREP both patterns and replace by sed with no result:

a=$(grep -A 1  --no-group-separator "$$$$" test.sdf | grep -v "$$$$")
b=$(grep -A 1  --no-group-separator "GENERIC_NAME" test.sdf | grep -v "GENERIC_NAME")
while $a $b,
do 
    sed -i "s/$a/$b/" test.sdf
done

CodePudding user response：

After correcting some syntax errors, the grep/sed combination will only work when you have exactly 1 match in a file. You want to replace all. You can use awk which is great for logic that spans different lines.
With the GNU sed option -z you can try some magic, but additional information would be fine.
When you are sure that you only have the > character in <GENERIC_NAME>, you can use

sed -rz 's/(\$\$\$\$\n)[^\n]*\n([^>]*)> ([^\n]*)/\1\3\n\2> \3/g' test.sdf

However, when you can have more > characters (or <GENERIC_NAME> is in fact a different string, you might need something like

sed -rz 's/<GENERIC_NAME>/\r/g;
  s/(\$\$\$\$\n)[^\n]*\n([^\r]*)\r ([^\n]*)/\1\3\n\2<GENERIC_NAME> \3/g' test.sdf

The sed commands gets uglier for each iteration (fix for another requirement) you implement, so it is a good idea to look at awk. Something like

awk -F'>' '
  /\$\$\$\$/ { getheader=1; print; next }
  getheader {header=$0; getheader=0; getbody=1; next}
  /<GENERIC_NAME>/ { printf("%s\n,%s\n%s\n", $2, body, $0); getbody=0; body=""; next}
  getbody { body = body $0 "\n"; next}
  {print}
  ' test.sdf

PS: Please edit your question (formatting, example output, show if there can be any lines outside the $$$$-<GENERIC_NAME> blocks)

CodePudding user response：

If I understand your question, you want to "replace 91443 by Tetrahydrofolic acid", meaning remove 91443 which follows $$$$ and replace it with "Tetrahydrofolic acid". If that is the case, then a simple awk expressions will do that simply sets a flag when "$$$$" is encountered as a line, replacing the next line with "Tetrahydrofolic acid", meaning remove 91443 which follows $$$$ and replace it with "Tetrahydrofolic acid", e.g.

awk '
  replace { $0 = "Tetrahydrofolic acid"; replace = 0 } 
  /^\$\$\$\$$/ { replace = 1 }
1' compound.sdf

(the 1 is simply shorthand for the default operation print)

The variable replace is simply set to 1 when the next line should be replaced, and back to 0 after the replacement until the next record of "$$$$" is encountered.

Example Use/Output

With your example input in the file compound.sdf, you would get:

$ awk '
>   replace { $0 = "Tetrahydrofolic acid"; replace = 0 }
>   /^\$\$\$\$$/ { replace = 1 }
> 1' compound.sdf
$$$$
Tetrahydrofolic acid
  -OEChem-10051719083D

 55 57  0     1  0  0  0  0  0999 V2000
   -5.0661   -1.1129    2.4181 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2383    1.9583    1.7563 O   0  0  0  0  0  0  0  0  0  0  0  0
    7.3280    0.6119   -1.9919 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1868    0.6987   -2.7387 O   0  0  0  0  0  0  0  0  0  0  0  0

Let me know if I misunderstood the question and I'm happy to help further.