I try to extract from the first page of NYT https://www.nytimes.com
the link of each article and the complete content of each article.
To extract the links I can use this example
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.tracing.start({
path: 'trace.json',
categories: ['devtools.timeline']
})
await page.goto('https://news.ycombinator.com/news')
// execute standard javascript in the context of the page.
const stories = await page.$$eval('a.storylink', anchors => { return anchors.map(anchor => anchor.textContent).slice(0, 10) })
console.log(stories)
await page.tracing.stop()
await browser.close()
})()
But I don't know how to extract the content of the articles (text) for each link.
Could you please help me? Thank you!
PS: I searched in all the examples and tutorials over the internet and I didn't find anything to help me.
CodePudding user response:
Use anchors.map(anchor => anchor.href)
for hrefs,
and anchors.map(anchor => anchor.innerText)
for text
CodePudding user response:
You can try this ........
const stories = await page.evaluate(() => {
const list = []
const news_items = document.querySelectorAll(".relevant-class")
for (const news_item of news_items) {
list.push({
heading: item.querySelector(".relevant_class h3").innerHTML,
article: item.querySelector(".relevant_class").innerHTML,
})
}
return list
})