Home > Back-end >  Get main content in a page while web scraping node js, Puppeteer, Cheerio
Get main content in a page while web scraping node js, Puppeteer, Cheerio

Time:02-14

I have a Project with Node JS on web scraping where I will have to scrape Heading and Text from Main Content. But the Problem is I'm not able to Determine which is Main Content When there is No aside or main tag or class/id/role named aside or main. I'm Using Puppeteer and Cheerio Library. I have Tried using Mercury Web Parser But it has its Own problems. Like It doesn't return any content from Pages that Built with Elementor Theme builder on Wordpress. If anyone have any idea on how can I differentiate main content from rest of the web page it will be really helpful.

CodePudding user response:

You can checkout Readability JS library from Mozilla. They use for reader view.

CodePudding user response:

Try to explore more about CSS Selectors and specificity.
If you're scraping Elementor, be sure to use this trick for the selector: Use data-elementor-(attributename) attributes for everything in DOM.

const mainContent = await page.waitForElement('[data-elementor-type="wp-page"]', {visible: true, timeout: 0})
  • Related