I need a regular expression where I can fetch only certain string pattern in the text, ignoring html tags like anchors. Here is an example of what I need:
Case: Search all words with 4 letters and 2 numbers, ignoring anchors
Expression:
Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16"</a> and ABCD22.
In this example above, it is only necessary to return the first expression AABB16 outside the anchor and the expression ABCD22.
CodePudding user response:
I you want to use regex you can first strip HTML from your input, then match what you are looking for:
const html = 'Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16</a> and ABCD22.'
console.log('html: ' html);
let text = html.replace(/<\/?[a-z][^>]*>/gi, '');
console.log('text: ' text);
let matches = text.match(/[A-z]{4}[0-9]{2}/g);
console.log('matches: ' JSON.stringify(matches));
Output:
html: Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16</a> and ABCD22.
text: Test expression containing AABB16 and also https://www.teste.com/AABB16 and ABCD22.
matches: ["AABB16","AABB16","ABCD22"]
Explanation of regex /<\/?[a-z][^>]*>/g
:
<
-- expect<
\/?
-- expect optional/
[a-z]
-- expect a letter[^>]*
-- scan over anything that is not>
>
-- expect>
/gi
-- global flag to match multiple, and to ignore case
Using a regex to strip HTML is not foolproof, for example above regex would fail if an an HTML attribute text contains >
. So, it's safer to use a proper HTML parser.
CodePudding user response:
Use .innerHTML
.
alert (document.getElementById("list").innerHTML);
<ul id="list">
<li><a href="#">Item 1</a></li>
<li><a href="#">Item 2</a></li>
<li><a href="#">Item 3</a></li>
</ul>