Home > Software engineering >  Get a string or all text ignoring html tags and texts inside html tags
Get a string or all text ignoring html tags and texts inside html tags

Time:10-18

I need a regular expression where I can fetch only certain string pattern in the text, ignoring html tags like anchors. Here is an example of what I need:

  • Case: Search all words with 4 letters and 2 numbers, ignoring anchors

  • Expression: Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16"</a> and ABCD22.

In this example above, it is only necessary to return the first expression AABB16 outside the anchor and the expression ABCD22.

CodePudding user response:

I you want to use regex you can first strip HTML from your input, then match what you are looking for:

const html = 'Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16</a> and ABCD22.'
console.log('html: '   html);
let text = html.replace(/<\/?[a-z][^>]*>/gi, '');
console.log('text: '   text);
let matches = text.match(/[A-z]{4}[0-9]{2}/g);
console.log('matches: '   JSON.stringify(matches));
Output:

html: Test expression containing AABB16 and also <a href="https://www.teste.com/AABB16">https://www.teste.com/AABB16</a> and ABCD22.
text: Test expression containing AABB16 and also https://www.teste.com/AABB16 and ABCD22.
matches: ["AABB16","AABB16","ABCD22"]

Explanation of regex /<\/?[a-z][^>]*>/g:

  • < -- expect <
  • \/? -- expect optional /
  • [a-z] -- expect a letter
  • [^>]* -- scan over anything that is not >
  • > -- expect >
  • /gi -- global flag to match multiple, and to ignore case

Using a regex to strip HTML is not foolproof, for example above regex would fail if an an HTML attribute text contains >. So, it's safer to use a proper HTML parser.

CodePudding user response:

Use .innerHTML.

alert (document.getElementById("list").innerHTML);
<ul id="list">
  <li><a href="#">Item 1</a></li>
  <li><a href="#">Item 2</a></li>
  <li><a href="#">Item 3</a></li>
</ul>

  • Related