I have a regular expression that is running on a string of HTML, but I need to exclude anything that is between a <p></p>
tag from being able to match with my regex. Is there a way of doing this in my current regex?
My regex (matches: $, %, decimal, and whole number values in a string): /(?:\$?)(?:\d{1,3}(?:,\d{3})*(?:\%?)|\d )(?:\.\d (?:\%?))?/g
Basically, this regex should match with the following.
<div>$50</div>
<p>$40</p>
<div>$30</div>
matches: $50 & $30
ignores: $40
CodePudding user response:
/(?<!<p>[^>]*)(?:\$?\d{1,3}(?:,\d{3})*(?:\.\d )?\%?)/g
will work on most browsers see https://regex101.com/r/vqZ6GO/2
I wrote most because negative Lookbehind is still not officially supported in all browsers but is supported in chrome edge and firefox on all supported versions.
full list Here
/(?:\$?\d{1,3}(?:,\d{3})*(?:\.\d )?\%?)(?![^>]*<\/p>)/g
will work on all browsers as seen here
CodePudding user response:
You could use DOMParser
to convert your string to Html and then use querySelectorAll
and forEach
to remove the p
tags from your document and then use your regex
:
const htmlString = "<div>$50</div><p>$40</p><div>$30</div>";
const doc = new DOMParser().parseFromString(htmlString , "text/html");
doc.querySelectorAll('p').forEach((a) => a.remove());
console.log(doc.body.innerHTML);
//do your regex captures with the doc.body.innerHTML
const matches = doc.body.innerHTML.match(/(?:\$?)(?:\d{1,3}(?:,\d{3})*(?:\%?)|\d )(?:\.\d (?:\%?))?/g);
console.log(matches);
CodePudding user response:
You can use a lookahead and lookbehind with an alternate to a exclusion class. There are two regexes:
- Matches
$,.%
and numbers only within HTML tags except for<p>
tags. - Matches
$,.%
and numbers anywhere except within<p>
tags.✼
Figure I
/(?<=<|[^p]>)[$.,%\d] (?=<[^p])/g
Figure II ✼
/(?<=<|[^p]>|\/p>)[$.,%\d] /g
Figure III
Segment | Description | Alternate |
---|---|---|
(?<= |
Start lookbehind - everything within the parenthesis of a lookbehind must match in order for whatever is after it to match. The lookbehind itself is not consumed (it isn't included in the results). | |
<| |
Match a literal < OR... |
|
[^p]> |
...anything but a literal p - followed by a literal > . |
|
|p\/> |
...OR a literal p/> ✼ |
#2 |
)[$.,%\d] |
End lookbehind - match any number of $.,% or number. This is a naïve expression assuming that the matches are in a logical and grammatically correct format. |
|
(?=<[^p]) |
Lookahead - everything within the parenthesis of a lookahead must match in order for whatever is before it to match. The lookahead itself is not consumed (it isn't included in the results). Match literal < and anything but a literal p . |
#1 |
✼Added alternate criteria per yuhsef's comment.
let str = `<div>$50.25</div>5.00<p>$100.00</p>150%<div>1,000%</div>2,000`;
const rgx1 = /(?<=<|[^p]>)[$.,%\d] (?=<[^p])/g;
const rgx2 = /(?<=<|[^p]>|\/p>)[$.,%\d] /g;
const matchesA = str.match(rgx1);
const matchesB = str.match(rgx2);
console.log(matchesA);
console.log(matchesB);