I am trying to strip all HTML brackets, except anything from the first line of code using this REGEX
(?ms)(?!\A)<[^>]*>
It's very close to working, unfortunately it strips the closing brackets from the first line as well. The example I am working with is:
<div id="uniquename">https://www.example.com?item_id=10302</div>
<div id="uniqname2">
<div id="uniqname3">
<h2 id="uniqnametitle">Title</h2>
<div >
<div >Example:</div>
<div ><b>Sub example</b></div>
</div>
<div >
<div >Additional</div>
The current REGEX removes all other HTML tags and excludes the first line with the exception of the trailing div close tag and outputs the following:
<div id="uniquename">https://www.example.com?item_id=10302
Title
Example:
Sub example
Additional
If there is a better way to perform the REGEX than excluding the first line I am open to suggestions. Skipping the first line seems to be the easiest way, however, I need the end bracket to stay intact.
What am I missing in my REGEX?
CodePudding user response:
You can try this
(?ms)((?<firstline>\A[^\n]*)|(<[^>]*>))
With substitution
$firstline
Playground for your example - https://regex101.com/r/ASItOP/3
CodePudding user response:
You should use an HTML parser in general...
However, you can do:
$ cat <(head -n 1 file) <(sed -nE '2,$ p' file | sed -E 's/<[^>]*>//g; /^$/d')
Or an awk:
$ awk 'FNR==1 {print; next}
{gsub(/<[^>]*>/,""); if ($0) print}' file
Either prints:
<div id="uniquename">https://www.example.com?item_id=10302</div>
Title
Example:
Sub example
Additional
CodePudding user response:
UPDATE 1 : just realized it could be massively simplified
gawk 'NR==!_ || (NF=NF)*/./' FS='<[^>] >' OFS=
mawk 'NR==!_ || (NF=NF)*/./' FS='^(<[^>] >) |(<[/][^>] >) $' OFS=
1 <div id="uniquename">https://www.example.com?item_id=10302</div>
2 Title
3 Example:
4 Sub example
5 Additional