Home > other >  Puppeteer not retrieving JavaScript rendered page
Puppeteer not retrieving JavaScript rendered page

Time:09-12

I am trying to load the product page using puppeteer but its not working.

    const puppeteer = require('puppeteer')

async function start(){
    const browser = await puppeteer.launch()
    const page = await browser.newPage()
    
    await page.setDefaultNavigationTimeout(0); 
    
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36');
    
    url = "https://www.coupang.com/vp/products/2275049712?itemId=3903560010"
    await page.goto(url, {'waitUntil' : ['load', 'domcontentloaded', 'networkidle0', 'networkidle2']})
    await page.screenshot({path: "screenshot3.png", fullPage:true})
    await browser.close();
}

start()

If we open this URL it will load the page half and when we scroll down it loads rest of the page.

I tried using the scroll as well but it did not work.

Scroll function is following

    [const waitTillHTMLRendered = async (page, timeout = 30000) => {
    const checkDurationMsecs = 1000;
    const maxChecks = timeout / checkDurationMsecs;
    let lastHTMLSize = 0;
    let checkCounts = 1;
    let countStableSizeIterations = 0;
    const minStableSizeIterations = 3;
  
    while(checkCounts   <= maxChecks){
      let html = await page.content();
      let currentHTMLSize = html.length; 
  
      let bodyHTMLSize = await page.evaluate(() => document.body.innerHTML.length);
  
      console.log('last: ', lastHTMLSize, ' <> curr: ', currentHTMLSize, " body html size: ", bodyHTMLSize);
  
      if(lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize) 
        countStableSizeIterations  ;
      else 
        countStableSizeIterations = 0; //reset the counter
  
      if(countStableSizeIterations >= minStableSizeIterations) {
        console.log("Page rendered fully..");
        break;
      }
  
      lastHTMLSize = currentHTMLSize;
      await page.waitForTimeout(checkDurationMsecs);
    }  
  };][2]

CodePudding user response:

When I run this headfully, I don't see that the page loads fully with the review content. It seems to be detecting the bot and blocking those reviews from coming through regardless of the scroll.

Using puppeteer-extra-stealth headfully avoids detection, but headless stealth is still blocked. I'll update if I can find a solution, but I figure this is at least a step forward.

const puppeteer = require("puppeteer-extra"); // ^3.2.3
const StealthPlugin = require("puppeteer-extra-plugin-stealth"); // ^2.9.0
puppeteer.use(StealthPlugin());

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  const url = "https://www.coupang.com/vp/products/2275049712?itemId=3903560010";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector(".sdp-review__article__list__review__content");
  await page.waitForNetworkIdle();
  await page.screenshot({path: "screenshot3.png", fullPage: true});
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

In the future, if you see waitForSelector timeouts when running headlessly, it's a good idea to add a console.log(await page.content()); which will usually show that you've been blocked before you waste time messing with scrolling and other futile strategies.

See also Why does headless need to be false for Puppeteer to work?

  • Related