remove data between two word with sed-CodePudding

Assume we have a file with this content:

<tag1>
junk1
junk2
</tag1>
data1
data2
data3
<tag1>junk3</tag1>
data4
data5

So, we wanna remove all data between two strings what are <tag1> and </tag1> here. I can do the job with sed command like:

cat input | sed '/<tag1>/,/<\/tag1>/d'

but there is a problem, the command doesn't work properly, and data after the one-liner tag1 tag is removed from the output. The output of the above command :

data1
data2
data3

So, the main question is, how can we remove data between two strings/tags/patterns even if those are one-line or multi-line data?

thanks

CodePudding user response：

As sed cannot parse xml files, there are many cases that sed does not work well (e.g. tags within a comment tag).
As sed regex does not support the non-greedy match, we need to consider about workarounds.

Based on the above, would you please try:

sed $'s/<tag1>/&\\\n/g' input | sed '/<tag1>/,/<\/tag1>/d'

Output:

data1
data2
data3
data4
data5

The first sed just puts a line break after the <tag1>.
Although it works for the provided example, please note there are many cases it doesn't work well (e.g. </tag1> is missing).

CodePudding user response：

Removing single line matches before the range may help as the range will match till the end of the file if another match is not found after the first match, in your case, the single line match.

$ sed '/>[a-z0-9]*</d;/</,/>/d' input_file
data1
data2
data3
data4
data5

/>[a-z0-9]*</d - Here, the single line is matched first. It could be precisely targeted if needed but the > bracket will suffice in this case.

/</,/>/d - Now your original code is implemented, as there is now only one range match, it removes that range and returns everything else.Once again, it can be more precise with tag1 but once again will suffice in this instance.

CodePudding user response：

Note: cat input | sed SCRIPT is useless, simply sed SCRIPT input. Let's assume that:

you use GNU sed,
you may have other tags (e.g., <tag2>),
you may have several groups on the same line ( a<tag1>b</tag1>c<tag1>d</tag1>e),
you don't have nested groups (<tag1>a<tag1>b</tag1>c</tag1>),
all your <tag1> and </tag1> are properly balanced.

GNU sed has the neat -z option that considers the NUL character as the line terminator, instead of the newline character. So, as your input file does not contain any NUL character, this allows to consider its content as one single string (with newline characters in it).

We can thus start deleting the <tag1>...</tag1> groups without considering whether they are on the same "line" or not. But as sed is greedy we cannot simply s#<tag1>.*</tag1>##g as it would remove everything between the first <tag1> and the last </tag1>: if you have more than one group it would also remove the text between groups.

We can however loop over two substitute commands: one that removes the empty groups <tag1></tag1>, followed by one that removes any single character after <tag1>, and repeat as long as single characters are removed:

$ cat input
<tag1>
junk1
junk2
</tag1>
data1
data2<tag1>junk3</tag1>data3<tag1>junk4</tag1>data4
data5
<tag1>junk5</tag1>
data6
<tag1>junk6</tag1>
$ sed -Ez ':a;s#<tag1></tag1>##g;s#(<tag1>).#\1#g;ta' input

data1
data2data3data4
data5

data6

Explanation: :a is a label, used for looping. s#<tag1></tag1>##g removes all empty groups. s#(<tag1>).#\1#g removes any single character after <tag1>. ta branches to label a if the previous substitution was successful. In other words we loop until there are no substitutions; in each iteration we remove all empty groups and remove one character between all non-empty <tag1>, </tag1> pairs. When we stop, all groups have been deleted.

If the empty lines it leaves shall also be removed we just add one final command that deletes all empty "lines". It does that by replacing any string of spaces (can be empty) between two newline characters (or between the beginning of the pattern space and a newline character), by a single newline character (or nothing if it was at the beginning of the pattern space):

$ sed -Ez ':a;s#<tag1></tag1>##g;s#(<tag1>).#\1#g;ta;s#(\`|\n)\s*\n#\1#g' input
data1
data2data3data4
data5
data6

CodePudding user response：

It may not be what you need but with XPath tools like xmllint and a valid XML input like:

<root>
<tag1>
junk1
junk2
</tag1>
<!-- <tag1> -->
data1
data2
data3
<hello><tag1>junk3</tag1></hello>
data4
<hello>data5</hello>
</root>

the command:

xmllint --xpath '//*[not(ancestor-or-self::tag1)]/text()' file.xml | grep -v '^[[:space:]]*$'

will output:

data1
data2
data3
data4
data5