Home > Software design >  Calculating word count after stripping HTML
Calculating word count after stripping HTML

Time:08-04

I have encountered a simple yet peculiar problem while calculating the word count of a string that contains HTML. The simple method is to first strip the HTML and then to count the whitespace. The problem I've found is that once you strip away the HTML tags some words are incorrectly concatenated.

See the example below that illustrates the issue using Javascript "textContent" to strip the HTML.

<p>One</p><p>Two</p><p>Three</p> becomes OneTwoThree and is counted as a single word.

How would you go about counting words (simply)?

var text = document.getElementById("test").textContent;
var words = text.match(/\S /g).length;
document.getElementById("words").textContent = words;
<div  id="test">
<p>One</p><p>Two</p><p>Three</p>
</div>

<div><span id="words">???</span> word(s)</div>

CodePudding user response:

Maybe this could work for you:

  1. Replace all tags with spaces, so <p>One</p><p>Two</p> would become One Two .
  2. Trim the middle spaces, and make them one space, so our string should just have an extra space on the left and right.
  3. Remove that extra space.
let html = "your html";
let tmp = html.replace(/(<([^>] )>)/ig," ");
tmp = tmp.replace(/\s /gm, " ");
console.log(tmp.replace(/^\s |\  $/gm, ""));

//Now we can count the number of spaces in tmp.
let count = (tmp.match(/ /g) || []).length;

CodePudding user response:

You need to use innerText instead to get the all text content even with whitespaces.

var textWithoutWhiteSpaces = document.getElementById("test").textContent;
var wordsWithoutWhiteSpaces = textWithoutWhiteSpaces.match(/\S /g).length;

var textWithWhiteSpaces = document.getElementById("test").innerText;
var wordsWithWhiteSpaces = textWithWhiteSpaces.match(/\S /g).length;

console.log(wordsWithoutWhiteSpaces)
console.log(wordsWithWhiteSpaces)

document.getElementById("words").textContent = wordsWithWhiteSpaces;
<div  id="test">
<p>One</p><p>Two</p><p>Three</p>
</div>

<div><span id="words">???</span> word(s)</div>

CodePudding user response:

I will add another option to count the number of words among the tags:

const str = '<p><p><p>One<br></p><p>Two</p><p>Three</p><p></p></p><h1>four</h1><b>five</b><H1>123</H1>';

const result = str
    .replace(/(<.*?>)/g, '|')
    .split('|')
    .filter((el) => el !== '').length;

console.log(result);

  • Related