Home > Software design >  Ignore all text between HTML tag JavaScript Regex
Ignore all text between HTML tag JavaScript Regex

Time:07-29

I have a regular expression that is running on a string of HTML, but I need to exclude anything that is between a <p></p> tag from being able to match with my regex. Is there a way of doing this in my current regex?

My regex (matches: $, %, decimal, and whole number values in a string): /(?:\$?)(?:\d{1,3}(?:,\d{3})*(?:\%?)|\d )(?:\.\d (?:\%?))?/g

Basically, this regex should match with the following.

<div>$50</div>
<p>$40</p>
<div>$30</div>

matches: $50 & $30
ignores: $40

CodePudding user response:

/(?<!<p>[^>]*)(?:\$?\d{1,3}(?:,\d{3})*(?:\.\d )?\%?)/g will work on most browsers see https://regex101.com/r/vqZ6GO/2

I wrote most because negative Lookbehind is still not officially supported in all browsers but is supported in chrome edge and firefox on all supported versions.

full list Here

/(?:\$?\d{1,3}(?:,\d{3})*(?:\.\d )?\%?)(?![^>]*<\/p>)/g will work on all browsers as seen here

CodePudding user response:

You could use DOMParser to convert your string to Html and then use querySelectorAll and forEach to remove the p tags from your document and then use your regex:

const htmlString = "<div>$50</div><p>$40</p><div>$30</div>";
const doc = new DOMParser().parseFromString(htmlString , "text/html");
doc.querySelectorAll('p').forEach((a) => a.remove());
console.log(doc.body.innerHTML);
//do your regex captures with the doc.body.innerHTML
const matches = doc.body.innerHTML.match(/(?:\$?)(?:\d{1,3}(?:,\d{3})*(?:\%?)|\d )(?:\.\d (?:\%?))?/g);
console.log(matches);

CodePudding user response:

You can use a lookahead and lookbehind with an alternate to a exclusion class. There are two regexes:

  1. Matches $,.% and numbers only within HTML tags except for <p> tags.
  2. Matches $,.% and numbers anywhere except within <p> tags.

Figure I

/(?<=<|[^p]>)[$.,%\d] (?=<[^p])/g

Figure II

/(?<=<|[^p]>|\/p>)[$.,%\d] /g

Figure III

Segment Description Alternate
(?<= Start lookbehind - everything within the parenthesis of a lookbehind must match in order for whatever is after it to match. The lookbehind itself is not consumed (it isn't included in the results).  
<| Match a literal < OR...  
[^p]> ...anything but a literal p - followed by a literal >.  
|p\/> ...OR a literal p/> #2
)[$.,%\d] End lookbehind - match any number of $.,% or number. This is a naïve expression assuming that the matches are in a logical and grammatically correct format.  
(?=<[^p]) Lookahead - everything within the parenthesis of a lookahead must match in order for whatever is before it to match. The lookahead itself is not consumed (it isn't included in the results). Match literal < and anything but a literal p. #1

#1 Regex101
#2 Regex101

Added alternate criteria per yuhsef's comment.

let str = `<div>$50.25</div>5.00<p>$100.00</p>150%<div>1,000%</div>2,000`;
const rgx1 = /(?<=<|[^p]>)[$.,%\d] (?=<[^p])/g;
const rgx2 = /(?<=<|[^p]>|\/p>)[$.,%\d] /g;

const matchesA = str.match(rgx1);
const matchesB = str.match(rgx2);

console.log(matchesA);
console.log(matchesB);

  • Related