Regex: Exclude first line with brackets-CodePudding

I am trying to strip all HTML brackets, except anything from the first line of code using this REGEX

(?ms)(?!\A)<[^>]*>

It's very close to working, unfortunately it strips the closing brackets from the first line as well. The example I am working with is:

<div id="uniquename">https://www.example.com?item_id=10302</div>
<div id="uniqname2">
<div id="uniqname3">
<h2 id="uniqnametitle">Title</h2>
<div >
<div >Example:</div>
<div ><b>Sub example</b></div>
</div>
<div >
<div >Additional</div>

The current REGEX removes all other HTML tags and excludes the first line with the exception of the trailing div close tag and outputs the following:

<div id="uniquename">https://www.example.com?item_id=10302
Title
Example:
Sub example
Additional

If there is a better way to perform the REGEX than excluding the first line I am open to suggestions. Skipping the first line seems to be the easiest way, however, I need the end bracket to stay intact.

What am I missing in my REGEX?

CodePudding user response：

You can try this
(?ms)((?<firstline>\A[^\n]*)|(<[^>]*>))
With substitution
$firstline

Playground for your example - https://regex101.com/r/ASItOP/3

CodePudding user response：

You should use an HTML parser in general...

However, you can do:

$ cat <(head -n 1 file) <(sed -nE '2,$ p' file | sed -E 's/<[^>]*>//g; /^$/d')

Or an awk:

$ awk 'FNR==1 {print; next}
      {gsub(/<[^>]*>/,""); if ($0) print}' file

Either prints:

<div id="uniquename">https://www.example.com?item_id=10302</div>
Title
Example:
Sub example
Additional

CodePudding user response：

UPDATE 1 : just realized it could be massively simplified

gawk 'NR==!_ || (NF=NF)*/./' FS='<[^>] >' OFS=

mawk 'NR==!_ || (NF=NF)*/./' FS='^(<[^>] >) |(<[/][^>] >) $' OFS=

 1  <div id="uniquename">https://www.example.com?item_id=10302</div>
 2  Title
 3  Example:
 4  Sub example
 5  Additional