How to remove punctuation without changing HTML tags in a JavaScript string?-CodePudding

I'm creating a Chrome extension that strips the punctuation from a page, but my code also affects all the HTML tags of the page as well. The ID, the style, and even the SVG paths are affected by the punctuation change.

function removePunctuation(text) {
    text = text.replace(/[\.#!£$%\^&\*;:{}=\-_`~()@\ \?\[\]\ ]/g, ' ').replace(/   /g, ' ')
    return text
}

let textTags = document.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, td, caption, span, a, div');
for (let i = 0, l = textTags.length; i < l; i  ) {
    textTags[i].innerHTML = removePunctuation(textTags[i].innerHTML)
}

To get all of the text from a page, I'm using document.querySelectorAll.
To take away all the punctuation, except angle brackets, from the innerHTML of the element, I loop through each element and use the regex above with the String.prototype.replace() method
I set the result back onto the page.

I unsuccessfully tried to save the position of each tag, take it out and add it back in once the punctuation is removed. It always excludes the parameters or misses out on some tags.

I also tried to find a regex that would exclude HTML tags from the removal of punctuation but looking at regex as a whole, I'm not sure if that's even possible!

CodePudding user response：

You need to find the textNodes in the document. Then update their nodeValue.

Taking some code from How to get the text node of an element? to extract all textNodes and you can then apply your code to only these

const deepNonEmptyTextNodes = el => [...el.childNodes].flatMap(e =>
  e.nodeType === Node.TEXT_NODE && e.textContent.trim() ?
  e : deepNonEmptyTextNodes(e)
);

function removePunctuation(text) {
  return text.replace(/[\.#!£$%\^&\*;:{}=\-_`~()@\ \?\[\]\ ]/g, ' ').replace(/   /g, ' ');
}

let textTags = [...document.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, td, caption, span, a, div')];

textTags.forEach(tagNode => {
  const textNodes = deepNonEmptyTextNodes(tagNode);
  textNodes.forEach(node => node.nodeValue = removePunctuation(node.nodeValue))
})

<p>clean up the following !£$%^& chars</p>
<strong>stay as $%^&* you are</strong>
<div>clean these !£$%^&[]@ also</div>

Keep in mind, though, that this code will modify text nodes inside tags you do not want, if they are nested inside tags you do want. So, if a strong tag is inside a div it will be cleaned up too.

In the same answer i linked to, you will find other methods that only return the immediate child text nodes of an element, if that is what you want.

CodePudding user response：

Can't you just focus the regular text tag?

Like just look in the <p>, <h1>, <h2> …, <span>, <em>, <strong>, <textarea>, etc.