Home > Software engineering >  puppeteer / node.js - enter page, click load more until all comments load, save page as mhtml
puppeteer / node.js - enter page, click load more until all comments load, save page as mhtml

Time:12-19

What i'm trying to accomplish is enter this site https://www.discoverpermaculture.com/permaculture-masterclass-video-1 wait until it loads, load all comments from disqus (click 'Load more comments' button until it's no longer present) and save page as mhtml for offline use.

I found similar question here Puppeteer / Node.js to click a button as long as it exists -- and when it no longer exists, commence action but unfortunately trying to detect the "Load more comments" button doesn't work for some reason.

Seems like WaitForSelector('a.load-more__button') is not working because all it prints out is "not visible".

Here's my code

const puppeteer = require('puppeteer');
const url = "https://www.discoverpermaculture.com/permaculture-masterclass-video-1";

const isElementVisible = async (page, cssSelector) => {
    let visible = true;
    await page
        .waitForSelector(cssSelector, { visible: true, timeout: 4000 })
        .catch(() => {
            console.log('not visible');
            visible = false;
        });
    return visible;
};

async function run () {

    let browser = await puppeteer.launch({
        headless: true,
        defaultViewport: null,
        args: [
            '--window-size=1920,10000',
        ],
    });
    const page = await browser.newPage();
    const fs = require('fs');
    await page.goto(url);
    await page.waitForNavigation();
    await page.waitForTimeout(4000)

    const selectorForLoadMoreButton = 'a.load-more__button';
    let loadMoreVisible = await isElementVisible(page, selectorForLoadMoreButton);
    while (loadMoreVisible) {
        console.log('load more visible');
        await page
            .click(selectorForLoadMoreButton)
            .catch(() => {});
    await page.waitForTimeout(4000);

        loadMoreVisible = await isElementVisible(page, selectorForLoadMoreButton);
    }

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);
    browser.close();
}
run();

CodePudding user response:

You're just waiting for an ajax request to be processed. You could simply save the total number of comments (top left of the DISQUS plugin) and compare it to an array of comments once the array is equal to the total then you've retrieved every comments.

I've posted something a while back on waiting for ajax request you can see it here: https://stackoverflow.com/a/66092889/3645650.


Alternatively, a simpler approach would be to just use the DISQUS api.

Comments are publicly accessible. You can just use the api key from the website:

https://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=7187962034&forum=pdc2018&order=popular&cursor=1:0:0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

parameter options
limit Default to 50. Maximum is 100.
thread Thread number. eg: 7187962034.
forum Forum id. eg: pdc2018.
order desc, asc, popular.
cursor Probably the page number. Format is 1:0:0. eg: Page 2 would be 2:0:0.
api_key The platform api key. Here the api key is E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F.

If you have to iterate through different pages you would need to intercept the xhr responses to retrieve the thread number.

CodePudding user response:

It turned out the problem was that disqus comments were inside of an iframe

//needed to add those 2 lines
const elementHandle = await page.waitForSelector('iframe');
const frame = await elementHandle.contentFrame();

//and change 'page' to 'frame' below
let loadMoreVisible = await isElementVisible(frame, selectorForLoadMoreButton);
while (loadMoreVisible) {
    console.log('load more visible');
    await frame
        .click(selectorForLoadMoreButton)
        .catch(() => {});
    await frame.waitForTimeout(4000);
    loadMoreVisible = await isElementVisible(frame, selectorForLoadMoreButton);
}

After this change it works perfect

  • Related