Home > Software design >  Need to select very specific element using querySelector without returning undefined
Need to select very specific element using querySelector without returning undefined

Time:11-06

I'm scraping a site for data using Puppeteer and need to get a really specific piece of data from the site, I'm trying to use querySelector to get the classname of where the data is but its proven rather difficult because there are 22 other elements that use the exact classname(the classname is FormData), out of the 22 its the 18th and I've been trying to select it and print it out but to no avail, I always get the same error or something along the lines.

Code

// MODULES
const puppeteer = require("puppeteer");

// Url where we get and scrape the data from
const URL = "https://www.sec.gov/edgar/search/#/category=form-cat2";

(async () => {
    try {
        const chromeBrowser = await puppeteer.launch({ headless: true });
        const page = await chromeBrowser.newPage();
        await page.goto(URL, {timeout: 0});

    const getInfo = await page.evaluate(() => {
        const secTableEN = document.querySelector(".table td.entity-name");
        const secTableFiled = document.querySelector(".table td.filed");
        const secTableLinkPrice = document.querySelector('.FormData')[17];

        return {
            secTableEN: secTableEN.innerText,
            secTableFiled: secTableFiled.innerText,
            secTableLinkPrice: secTableLinkPrice.innerText,
        };
    });

    console.log(
        "Name: "   getInfo.secTableEN, '\n'  
        "Amount Purchased: "   getInfo.secTableLinkPrice, '\n'
    );

    await page.close();
    await chromeBrowser.close();
    } catch (e) {
        console.error(e)
    }
})();

The error I'm always getting is:Error: Evaluation failed: TypeError: Cannot read properties of undefined (reading 'innerText') and only always happens when I try returning the secTableLinkPrice.innerText the other two alone always work fine. What can I do?

CodePudding user response:

Apparently the price you want from the top result is in a popup, so you need to click on one of the .preview-file links to make that popup appear. Only then can you select .FormData from the iframe modal.

const puppeteer = require("puppeteer"); // ^19.1.0

const url = "<YOUR URL>";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const $ = (...args) => page.waitForSelector(...args);
  await (await $(".filetype .preview-file")).click();
  const frame = await (await $("#ipreviewer")).contentFrame();
  await frame.waitForSelector(".FormText");
  const price = await frame.$$eval(".FormText", els =>
    els.find(e => e.textContent.trim() === "$")
      .parentNode
      .textContent
      .trim()
  );
  console.log(price);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Now, the popup triggers a network request to an XML file (which appears to be HTML), so it might be easiest to just download that, since it probably has all of the data you want. In the code below, I'm actually parsing and traversing that HTML with Puppeteer, so it looks like more work, but perhaps you could just save this file to disk, depending on your needs:

// ... same as above ...

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const responseP = page.waitForResponse(res =>
    res.status() === 200 && res.url().endsWith(".xml")
  );
  const a = await page.waitForSelector(".filetype .preview-file");
  await a.click();
  const html = await (await responseP).text();
  await page.evaluate(html => document.body.outerHTML = html, html);
  const price = await page.$$eval(".FormText", els =>
    els.find(e => e.textContent.trim() === "$")
      .parentNode
      .textContent
      .trim()
  );
  console.log(price);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Finally, some documents don't have a price, so the above code only works on the "4 (Insider trading report)". Furthermore, I haven't validated that all of these "type 4" reports are exactly the same. You'll probably want to handle this in your code and proceed carefully.

CodePudding user response:

You put it inside an if statement. This tests that the node exists first so you will not get an undefined error.

  • Related