I have encountered a simple yet peculiar problem while calculating the word count of a string that contains HTML. The simple method is to first strip the HTML and then to count the whitespace. The problem I've found is that once you strip away the HTML tags some words are incorrectly concatenated.
See the example below that illustrates the issue using Javascript "textContent" to strip the HTML.
<p>One</p><p>Two</p><p>Three</p>
becomes OneTwoThree
and is counted as a single word.
How would you go about counting words (simply)?
var text = document.getElementById("test").textContent;
var words = text.match(/\S /g).length;
document.getElementById("words").textContent = words;
<div id="test">
<p>One</p><p>Two</p><p>Three</p>
</div>
<div><span id="words">???</span> word(s)</div>
CodePudding user response:
Maybe this could work for you:
- Replace all tags with spaces, so
<p>One</p><p>Two</p>
would becomeOne Two
. - Trim the middle spaces, and make them one space, so our string should just have an extra space on the left and right.
- Remove that extra space.
let html = "your html";
let tmp = html.replace(/(<([^>] )>)/ig," ");
tmp = tmp.replace(/\s /gm, " ");
console.log(tmp.replace(/^\s |\ $/gm, ""));
//Now we can count the number of spaces in tmp.
let count = (tmp.match(/ /g) || []).length;
CodePudding user response:
You need to use innerText
instead to get the all text
content even with whitespaces.
var textWithoutWhiteSpaces = document.getElementById("test").textContent;
var wordsWithoutWhiteSpaces = textWithoutWhiteSpaces.match(/\S /g).length;
var textWithWhiteSpaces = document.getElementById("test").innerText;
var wordsWithWhiteSpaces = textWithWhiteSpaces.match(/\S /g).length;
console.log(wordsWithoutWhiteSpaces)
console.log(wordsWithWhiteSpaces)
document.getElementById("words").textContent = wordsWithWhiteSpaces;
<div id="test">
<p>One</p><p>Two</p><p>Three</p>
</div>
<div><span id="words">???</span> word(s)</div>
CodePudding user response:
I will add another option to count the number of words among the tags:
const str = '<p><p><p>One<br></p><p>Two</p><p>Three</p><p></p></p><h1>four</h1><b>five</b><H1>123</H1>';
const result = str
.replace(/(<.*?>)/g, '|')
.split('|')
.filter((el) => el !== '').length;
console.log(result);