I am using Puppeteer to crawl some data and need to go to many pages in a relatively short time. After an observation, I noticed that this is pretty inefficient because I am only interested about the data in the markup file while the whole page with all images, fonts and whatnot is pretty slow. So it'd be nice if there is a way to skip the other content types and make Puppeteer return only HTML file content. Here is my code:
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
const helperFile = fs.readFileSync("dist/app/scripts/helpers.js", "utf8");
await page.evaluateOnNewDocument(helperFile);
await login(page);
await postLogin(page);
await crawl(page); // this function is gonna call a lot of page.goTo(...)
await browser.close();
CodePudding user response:
You can intercept all requests from Puppeteer and only allow the ones that return the document to continue()
and discard the rest.
I also decided to include the script
type because the JS code may modify the initial DOM tree (something like appendChild(node)
), this is especially true if you're using SPA with a modern FW/library like React where the server only returns a couple of JS bundles to generate the HTML in the client. The script
and fetch
types are there in case the JS code makes additional requests to the server to get more data and update the DOM tree.
import puppeteer, { Page, PageEmittedEvents } from "puppeteer";
const htmlOnly = async (page: Page) => {
await page.setRequestInterception(true); // enable request interception
page.on(PageEmittedEvents.Request, (req) => {
if (!["document", "xhr", "fetch", "script"].includes(req.resourceType())) {
return req.abort();
}
req.continue();
});
};
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await htmlOnly(page);