Home > database >  Extract links and articles - Puppeteer
Extract links and articles - Puppeteer

Time:07-21

I try to extract from the first page of NYT https://www.nytimes.com the link of each article and the complete content of each article.

To extract the links I can use this example

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.tracing.start({
    path: 'trace.json',
    categories: ['devtools.timeline']
  })
  await page.goto('https://news.ycombinator.com/news')

  // execute standard javascript in the context of the page.
  const stories = await page.$$eval('a.storylink', anchors => { return anchors.map(anchor => anchor.textContent).slice(0, 10) })
  console.log(stories)
  await page.tracing.stop()
  await browser.close()
})()

But I don't know how to extract the content of the articles (text) for each link.

Could you please help me? Thank you!

PS: I searched in all the examples and tutorials over the internet and I didn't find anything to help me.

CodePudding user response:

Use anchors.map(anchor => anchor.href) for hrefs,

and anchors.map(anchor => anchor.innerText) for text

CodePudding user response:

You can try this ........

const stories = await page.evaluate(() => {
  const list = []
  const news_items = document.querySelectorAll(".relevant-class")

  for (const news_item of news_items) {
    list.push({
      heading: item.querySelector(".relevant_class h3").innerHTML,
      article: item.querySelector(".relevant_class").innerHTML,
     
    })
  }

  return list
})
  • Related