How to find all strings on a page?-CodePudding

I almost managed to do what I want, but there is a small flaw.

I have this HTML

<body>
  <div>
    <div>div</div>
  </div>

  <h1>
    <h2>
      <p>p1</p>

      <p>
        <p>p2</p>
      </p>
    </h2>

    <h3>
      <h2>h2</h2>
      <h2>h2</h2>
    </h3>
  </h1>

  <span>span</span>
  <h6>
    <h6>h6</h6>
  </h6>
</body>

And my last attempt gives me almost the array I want

var elements = Array.from(document.body.getElementsByTagName("*"));
var newStrings = [];

for (var i = 0; i < elements.length; i  ) {
  const el = elements[i];
  if (el.innerText.length !== 0) {
    newStrings.push(el.innerText);
  }
}

console.log(newStrings); //  ['div', 'div', 'p1\n\np2', 'p1', 'p2', 'h2', 'h2', 'span', 'h6']

but as a result I need ['div', 'p1', 'p2', 'h2', 'h2', 'span', 'h6']

I will be very grateful for your help!

CodePudding user response：

The best way to get all the strings on the page is to select all text nodes in the page and then get the text content of each (this way, you avoid getting duplicate strings in cases where you select the innerText of both the parent and child).

Here is one way to select all the text nodes in a page (adapted from https://stackoverflow.com/a/10730777/19461620):

const textNodes = [];
const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT, null, false);
let n;
while (n = walker.nextNode()) textNodes.push(n);
const newStrings = textNodes.map(textNode => textNode.textContent).filter(text => text.trim() !== '')
console.log(newStrings) // outputs: ['div', 'p1', 'p2', 'h2', 'h2', 'span', 'h6']

CodePudding user response：

try this, you will get the desired output:

function getInnerText() {
    const elements = document.querySelectorAll("*");
  
    const innerTexts = [];
  
    for (let element of elements) {
      const innerText = element.innerText;
  
      if (innerText && innerText.length > 0 && innerText.trim().length > 0) {
        innerTexts.push(innerText);
      }
    }
  
    return innerTexts[0].split('\n').filter(function (el) {
        return el != "";
        });
  }

const innerTexts = getInnerText();

console.log(innerTexts);