I am trying to use puppeteer to extract the innerHTML value from a button on a webpage. For now, I am simply trying to await the appearance of the selector to allow me to then work with it.
On running the below code the program times out waiting.
const puppeteer = require("puppeteer");
const link =
"https://etherscan.io/tx/0xb06c7d09611cb234bfcd8ccf5bcd7f54c062bee9ca5d262cc5d8f3c4c923bd32";
async function configureBrowser() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(link);
return page;
}
async function findFee(page) {
await page.reload({ waitUntil: ["networkidle0", "domcontentloaded"] });
await page.waitForSelector("#txfeebutton");
console.log("boom");
}
const setup = async () => {
const page = await configureBrowser();
await findFee(page);
await browser.close();
};
setup();
As you can see below, the element definitely exists in the HTML:
Console output:
CodePudding user response:
It works fine with a user agent string:
const puppeteer = require("puppeteer"); // ^14.3.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
await page.setExtraHTTPHeaders({"Accept-Language": "en-US,en;q=0.9"});
await page.setUserAgent(ua);
const url = "https://etherscan.io/tx/0xb06c7d09611cb234bfcd8ccf5bcd7f54c062bee9ca5d262cc5d8f3c4c923bd32";
await page.goto(url);
const btn = await page.waitForSelector("#txfeebutton");
console.log(await btn.evaluate(el => el.textContent.trim())); // => ($0.56)
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
One debugging strategy for this trying the same script with headless: false
and seeing if that works, then checking page.content()
when running headlessly. You can see Cloudflare is detecting your scraper and presenting a captcha.
Related:
- Puppeteer can't find elements when Headless TRUE
- Why does headless need to be false for Puppeteer to work?
As an aside, configureBrowser
leaks a reference to the browser
object, so you'll never be able to call browser.close()
and gracefully terminate the process. I recommend the above boilerplate and avoiding writing premature abstractions.