Home > Software engineering >  Loading dynamic webpage with Puppeteer works on localhost but not Heroku
Loading dynamic webpage with Puppeteer works on localhost but not Heroku

Time:09-28

Node.js app with Express, deployed on Heroku. It's just dynamic webpages. Loading static webpages works fine.

Loading dynamic webpages works on localhost, but on Heroku it throws me code=H12, desc="Request timeout", service=30000ms, status=503.

In addition, fresh after doing heroku restart or making a deployment, there always seems to be one instance of a status=200 that loads only the static portion of a dynamic webpage.

Screenshot of logs here.


I've tried the following, which have all led to either the same or other unexpected results when deployed on Heroku (such as Error R14 (Memory quota exceeded) and code=H13 desc="Connection closed without response"):

  • Switching the Puppeteer Heroku buildpack I was using. I've tried the ones mentioned in this troubleshooting guide and this comment.
  • Adding headless: true in Puppeteer's launch arguments.
  • Adding the --no-sandbox, --disable-setuid-sandbox, --single-process, and --no-zygote flags in args of Puppeteer's launch arguments. (Reference: this comment & this comment)
  • Setting the waitUntil argument in Puppeteer's goto function to domcontentloaded, networkidle0 and networkidle2. (Reference: this comment)
  • Passing a timeout argument in Puppeteer goto function; I've tried 30000 and 60000 specifically, as well as 0 per this comment.
  • Using the waitForSelector function.
  • Clearing Heroku's build cache, as per this article.
  • Printing the url variable (see my code below) in the console. Output is as expected.

I've observed that:

  • With the code I have right now (see below), the try-catch-finally block never catches any error. It's always one of the following: I get an incomplete result (static portion of requested dynamic webpage), or the app crashes (code=H13 desc="Connection closed without response"). So I haven't been able to get anything out of attempting to print exception in the console from within the catch block.

Any ideas on how I could get this to work?

const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
let browser;

...

app.listen(port, async() => {
  browser = await puppeteer
    .launch({
      timeout: 0,
      headless: true,
      args: [
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--single-process",
        "--no-zygote",
      ],
    });
});

...

app.get("/appropriate-route-name", async (req, res) => {
  let url = req.query.url;
  let page = await browser.newPage();

  try {
    await page.goto(url, {
      waitUntil: "networkidle2",
    });
    res.send({ data: await page.content() });
  } catch (exception) {
    res.send({ data: null });
  } finally {
    await browser.close();
  }
}

CodePudding user response:

Was able to get it to work by using user-agents. Dynamic pages now load just fine on Heroku; requests don't time out every single time anymore.

const app = express();
const puppeteer = require("puppeteer");
let port = process.env.PORT || 3000;
var userAgent = require("user-agents");

...

app.get("/route-name", async (req, res) => {
  let url = req.query.url;
  let browser = await puppeteer.launch({
    args: ["--no-sandbox"],
  });
  let page = await browser.newPage();

  try {
    await page.setUserAgent(userAgent.toString()); // added this
    await page.goto(url, {
      timeout: 30000,
      waitUntil: "newtorkidle2", // or "networkidle0", depending on what you need
    });
    res.send({ data: await page.content() });

  } catch (e) {
    res.send({ data: null });

  } finally {
    await browser.close();
  }
});
  • Related