How does Javascript filter out HTML tags while selecting words using regular expressions?-CodePudding

Notice：

I'm not parsing HTML with regex,
here I only use it for plain text.
It's just that it goes beyond plain text and affects other html tags

Why does everyone say I should use DOM instead of regular expressions? DOM obviously cannot select all words on a web page based on an array of words.

before I used document.createTreeWalker() to filter all text labels, it was too complicated and caused more errors. So I want to do it with simple regex instead. Do you have a better way?

I think just 'filter out all text inside "<>"' with very simple regex syntax wouldn't it work? Why make it so complicated?

I need to select the words from the page based on an array of words, and wrap the words around 'span' tags (keeping the original HTML tags).

The problem with my code is that it replaces the attribute values of the HTML tag as well.

I need regular expressions to filter out HTML tags and select words.

I added a condition to the regular expression :(^<.*>), but it didn't work and broke my code.

How to do?

My code:

code Error: The <div id="text"> should not be wrapped around the SPAN tag

<!DOCTYPE html>
<html>
<head>
<style>span{background:#ccc;}</style>
<script>
//wrap span tags for all words
function add_span(word_array, element_) {
    for (let i = 0; i < word_array.length; i  ) {
        var reg_str = "([\\s.?,\"\';:!()\\[\\]{}<>\/])";  //    "^(<.*>)"
        var reg = new RegExp(reg_str   "("   word_array[i]   ")"   reg_str, 'g');
        element_ = element_.replace(reg, '$1<span>$2</span>$3');
    }
    return element_;
}

window.onload = function(){
  console.log(document.body.innerText);
  // word array
  var word_array = ['is', 'test', 'testis', 'istest', 'text']

  var text_html = add_span(word_array, document.body.innerHTML);
  document.body.innerHTML = text_html;
  console.log(text_html);
}
</script>
</head>
<body>
<div id="text"><!--Error: The class attribute value here should not be wrapped around the SPAN tag-->
is test testis istest,
is[test]testis{istest}testis(istest)testis istest
</div>
</body></html>

CodePudding user response：

I had fun with this one and learned a few things too. You could replace the traversal implementation with TreeWalker if you'd like. I added a nested div#text2 to demonstrate how it works with arbitrary tree depth. I tried to keep the same general approach you were using, but needed to make some modifications to the regex and add tree traversal. Hope this helps!

function traverse(tree) {
  const queue = [tree];

  while (queue.length) {
    const node = queue.shift();

    if (node.nodeType === Node.TEXT_NODE) {
      const textContent = node.textContent.trim();
      if (textContent) {
        const textContentWithSpans = textContent
        .replaceAll(/\b(is|test|testis|istest|text)\b/g, '<span>$&</span>');
          
        const template = document.createElement('template');
        template.innerHTML = textContentWithSpans;

        const fragment = template.content;
        
        node.parentNode.replaceChild(fragment, node);
      }
    }
    
    for (let child of node.childNodes) {
      queue.push(child);
    }
  }
}

traverse(document.getElementById('demo-wrapper'));

<div id="demo-wrapper">
  <div id="text">
  is test testis istest,
  is[test]testis{istest}testis(istest)testis istest
    <div id="text2">
    foo bar test istest
    </div>
  </div>
</div>