I have 250 blocks of HTML list items, and I need to remove specific lines between <h3></h3>
tags.
The lines (including h3, li, a) that need to be removed will contain "USPS".
<ul>
<h3>
<li><a href="medicine/Alabama/Birmingham">Medicine in Birmingham, AL</a>
</li>
</h3>
<h3>
<li><a href="/shampoo/Alabama/Birmingham">Shampoo in Birmingham, AL</a>
</li>
</h3>
<h3>
<li><a href="/usps/Alabama/Birmingham">USPS in Birmingham, AL</a></li>
</h3>
<h3>
<li><a href="/snacks/Alabama/Birmingham">Snacks in Birmingham, AL</a></li>
</h3>
</ul>
<ul>
<h3>
<li><a href="/medicine/Arizona/Mesa">Medicine in Mesa, AZ</a></li>
</h3>
<h3>
<li><a href="/shampoo/Arizona/Mesa">Shampoo in Mesa, AZ</a></li>
</h3>
<h3>
<li><a href="/usps/Arizona/Mesa">USPS in Mesa, AZ</a></li>
</h3>
<h3>
<li><a href="/snacks/Arizona/Mesa">Snacks in Mesa, AZ</a></li>
</h3>
</ul>
I have tried using regex, but it's removing too much. I have a saved link here for the latest regex attempt: https://regex101.com/r/l4Ud4v/1
(?s)<h3>.*USPS.*?<\/h3>
Desired results:
<ul>
<h3>
<li><a href="medicine/Alabama/Birmingham">Medicine in Birmingham, AL</a>
</li>
</h3>
<h3>
<li><a href="/shampoo/Alabama/Birmingham">Shampoo in Birmingham, AL</a>
</li>
</h3>
<h3>
<li><a href="/snacks/Alabama/Birmingham">Snacks in Birmingham, AL</a></li>
</h3>
</ul>
<ul>
<h3>
<li><a href="/medicine/Arizona/Mesa">Medicine in Mesa, AZ</a></li>
</h3>
<h3>
<li><a href="/shampoo/Arizona/Mesa">Shampoo in Mesa, AZ</a></li>
</h3>
<h3>
<li><a href="/snacks/Arizona/Mesa">Snacks in Mesa, AZ</a></li>
</h3>
</ul>
There are 250 of these "USPS" instances that need to removed while preserving the rest of the HTML.
CodePudding user response:
Try
(?s)<h3>(?:(?!</h3>).)*USPS.*?</h3>
https://regex101.com/r/AB6wxS/1
Even non-greedy (?s)<h3>.*?USPS.*?</h3>
will fail because it'll match at the first <h3>
and then consume until it finds USPS, matching over the closing tags. To avoid that you can do (?:(?!</h3>).)*
which basically says match any char as long as it's not the start of </h3>
.
CodePudding user response:
If you have that specific formatting for all the lines (with h3, li, a), and you want to match them in Sublime:
<h3>\s*<li>\s*<a\b[^<>]*>[^<>]*\bUSPS\b[^<>]*</a>\s*</li>\s*</h3>
The \s*
matches optional whitespace characters, and [^<>]*
is a negated character class that matches any character including newlines, except for <
and >
See a regex demo.