Puppeteer find list of shadowed elements and get list of ElementHandles-CodePudding

I'm running Node 12 along with Puppeteer 2.2.1 (both can't be upgraded right now).

The challenge is to find DOM elements inside shadowed roots and passing them to another function of my main class. Additionally, I'm scraping different websites, so the code has to find shadowed roots dynamically. Using page.$() or page.$$() I'm not able to check if element has a shadowRoot, using page.$$eval() or page.evaluate() it seems, I cannot pass back element handles to main function (at least I don't know how to do this).

For example, this code I'm able to traverse over all elements and find elements with matching criteria, but evaluate() returns texts and evaluateHandle() returns a JSHandle only, which are not iterable.

#!/usr/bin/env node

'use strict';

const puppeteer = require('puppeteer');

class Generator {
    main() {
        const url = 'https://www.example.com';
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url, {waitUntil: 'networkidle2'});
        await this.scrape(page);
    }

    async process(page, element) {
        await page.evaluate(element => {
            element.click();
        }, element);
        this.doOtherThingsWithElement(element);
    }

    async scrape(page) {
        const elements = await page.evaluateHandle(() => {
            const walk = root => [
            ...[...root.querySelectorAll('[class*=shadowed-child] button')],
            ...[...root.querySelectorAll('*')]
                .filter(e => e.shadowRoot)
                .flatMap(e => walk(e.shadowRoot))
            ];
            return walk(document);
        });
        // THIS does not work
        for (const element of elements) {
            await this.process(page, element);
        }
    }
}

How can I access the DOM elements in for loop?

CodePudding user response：

How can I access the DOM elements in for loop?

You can't, as of the time of writing. A JSHandle is not iterable and has no length or indexing capabilities. As this answer shows, the only way to work with this JSHandle is to run it through another evaluate, at which point the array is accessible:

const puppeteer = require("puppeteer"); // ^19.1.0

const html = `<!DOCTYPE html><html><body>
  <h1>foo</h1>
  <h1>bar</h1>
  <h1>baz</h1>
</body></html>`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const handle = await page.evaluateHandle(() =>
    document.querySelectorAll("h1")
  );
  const text = await handle.evaluate(els =>
    [...els].map(e => e.textContent)
  );
  console.log(text); // => [ 'foo', 'bar', 'baz' ]
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

It's unimportant that the above example doesn't use shadow roots; the behavior is the same either way:

const html = `<!DOCTYPE html><html><body>
  <div></div>
<script>
const el = document.querySelector("div");
const root = el.attachShadow({mode: "open"});
el.shadowRoot.innerHTML = \`
  <h1>foo</h1>
  <h1>bar</h1>
  <h1>baz</h1>
\`;
</script>
</body></html>`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const handle = await page.evaluateHandle(() =>
    document
      .querySelector("div")
      .shadowRoot
      .querySelectorAll("h1")
  );
  const text = await handle.evaluate(els =>
    [...els].map(e => e.textContent)
  );
  console.log(text); // => [ 'foo', 'bar', 'baz' ]
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

I'll just use the non-shadow root boilerplate for the rest of this post for simplicity, keeping in mind page.$$ is not an option, which is the primary relevance of the shadow roots to this particular problem.

Now, there are some additional workarounds I can offer. These are generic since I'm not sure what your actual use case is, so hopefully you can figure it out from here. One approach is to store the shadow roots in a variable in the DOM, then return the length and use that in a loop:

// ... same boilerplate
const length = await page.evaluate(() => {
  window.nodes = [...document.querySelectorAll("h1")];
  return window.nodes.length;
});

for (let i = 0; i < length; i  ) {
  const text = await page.evaluate(i =>
    window.nodes[i].textContent, i
  );
  console.log(text);
}
// ...

A more naive approach is to just do the traversal and selection every time you need to evaluate, which avoids storing the array on the window which can potentially go stale after a mutation or clash with a global.

Taking the approach a step further, you can make an array that works sort of like page.$$'s ElementHandle[] return value enough for your intended use case:

// ...
const length = await page.evaluate(() => {
  window.nodes = [...document.querySelectorAll("h1")];
  return window.nodes.length;
});
const nodes = await Promise.all([...Array(length)].map((_, i) => 
  page.evaluateHandle(i => window.nodes[i], i)
));

// make up a silly process function
const process = element =>
  // element.click() will now work, but for simplicity:
  element.evaluate(el => el.textContent);

// now go and do the thing you wanted to do originally
for (const node of nodes) {
  console.log(await process(node));
}
// ...

Note that window.nodes is just an example, probably not the best property to avoid unlikely clashes, and of course replace window.nodes = [...document.querySelectorAll("h1")]; with your shadow root traversal.

Also, if length is large, const nodes = await Promise.all might be too much parallelism, in which case you can use a traditional imperative serial loop:

// ...
const nodes = [];
for (let i = 0; i < length; i  ) {
  nodes.push(await page.evaluateHandle(i => window.nodes[i], i));
};

for (const node of nodes) { /* ... */ }
// ...

For some use cases, the :nth-child CSS pseudoselector can work for indexing as well.

Depending on your use case, page.exposeFunction may help too, letting you trigger Node code from the browser context depending on each element you're processing.

Additionally, you may be able to use native DOM functions for clicking and doing whatever else you need to do, avoiding the need to make an array of handles in Node context and keeping all of your activity inside evaluate(s).