Get a string or all text ignoring html tags and texts inside html tags-CodePudding

I need a regular expression where I can fetch only certain string pattern in the text, ignoring html tags like anchors. Here is an example of what I need:

Case: Search all words with 4 letters and 2 numbers, ignoring anchors
Expression: Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16"</a> and ABCD22.

In this example above, it is only necessary to return the first expression AABB16 outside the anchor and the expression ABCD22.

CodePudding user response：

I you want to use regex you can first strip HTML from your input, then match what you are looking for:

const html = 'Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16</a> and ABCD22.'
console.log('html: '   html);
let text = html.replace(/<\/?[a-z][^>]*>/gi, '');
console.log('text: '   text);
let matches = text.match(/[A-z]{4}[0-9]{2}/g);
console.log('matches: '   JSON.stringify(matches));

Output:

html: Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16</a> and ABCD22.
text: Test expression containing AABB16 and also https://www.teste.com/AABB16 and ABCD22.
matches: ["AABB16","AABB16","ABCD22"]

Explanation of regex /<\/?[a-z][^>]*>/g:

< -- expect <
\/? -- expect optional /
[a-z] -- expect a letter
[^>]* -- scan over anything that is not >
> -- expect >
/gi -- global flag to match multiple, and to ignore case

Using a regex to strip HTML is not foolproof, for example above regex would fail if an an HTML attribute text contains >. So, it's safer to use a proper HTML parser.

CodePudding user response：

Use .innerHTML.

alert (document.getElementById("list").innerHTML);

<ul id="list">
  <li><a href="#">Item 1</a></li>
  <li><a href="#">Item 2</a></li>
  <li><a href="#">Item 3</a></li>
</ul>