I'm facing some problems with Puppeteer, I want to extract a list of items and succeed when headless is FALSE but not when TRUE.
First thing first, I want to get those elements before mapping on it.
Here's my script, maybe you can reproduce it, it is really basic.
const chalk = require("chalk");
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl searchTerm;
(async () => {
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
args: [`--window-size=1920,1080`],
defaultViewport: {
width: 1920,
height: 1080,
},
});
const page = await browser.newPage();
// Begin navigation
console.log(chalk.yellow("Beginning navigation."));
await page.goto(searchUrl);
// Await List of elements;
console.log(chalk.yellow("Wait for Network Idle..."));
await page.waitForNetworkIdle();
// get Items
const findElements = await page.evaluate(() => {
const elements = document.querySelectorAll(".sale-item");
console.log(elements);
return elements;
});
console.log(findElements);
console.log(chalk.blue("Waiting..."));
await page.waitForTimeout(10000);
await browser.close();
console.log(chalk.red("Closed."));
})();
Expected results : {
'0': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'1': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'2': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'3': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
'4': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
.
.
}
CodePudding user response:
For starters, I'd prefer page.waitForSelector(yourSelector)
over page.waitForNetworkIdle();
. In most cases, it's a more direct guarantee that the data you want is on the page, whereas network idle can block waiting for all sorts of requests that are totally irrelevant to the data you're trying to scrape.
Some websites check the headers to block scrapers. You can try adding a user agent header as described in the Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true } #665:
const puppeteer = require("puppeteer");
const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
const searchTerm = "Apple";
const searchUrl = baseUrl searchTerm;
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(searchUrl);
await page.waitForSelector(".sale-item");
const elements = await page.$$(".sale-item");
console.log(elements.length); // => 48
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Using puppeteer-extra as described in Why does headless need to be false for Puppeteer to work? is another option you can try. It also anonymizes the user agent headers.
CodePudding user response:
Why did you ignore HTTPS errors?
Are you sure that you are allowed to scrape this website? If not, it can be a problem...
Try again without ignore.