Home > Software design >  Trying to scrap a link from webpage with puppeteer
Trying to scrap a link from webpage with puppeteer

Time:10-11

I do not understand why I can not retrieve the href link from this page with puppeteer: PubChem.

I have run Chrome and have the page inspected and found the desired Headlineof the chemical and copied the Selector which looks like this: #featured-results > div:nth-child(2) > div.box-shadow > div > div.p-md-rectangle.flex-container.flex-nowrap.width-100 > div.flex-grow-1.p-md-left > div.f-medium.p-sm-top.p-sm-bottom.f-1125 > a

and then I have run this JS code with nodejs.

     const puppeteer = require('puppeteer')
     puppeteer.launch({ headless: true }).then(async browser => {
         const page = await browser.newPage()
         await page.goto('https://pubchem.ncbi.nlm.nih.gov/#query=MES')
     //    const cookies = await page.cookies()
     //    console.log(cookies)
         const links = await page.evaluate(() => [document.querySelectorAll('#featured-results > div:nth-child(2) > div.box-shadow > div > div.p-md-rectangle.flex-container.flex-nowrap.width-100 > div.flex-grow-1.p-md-left > div.f-medium.p-sm-top.p-sm-bottom.f-1125 > a')].map(link => link.href))
         links.forEach(link => console.log(link))
        
         await browser.close()
     })

But my result is NULL. Could anyone here open my eyes please? Thanks.

CodePudding user response:

  1. You need to wait for the element to appear.
  2. You need spread (...) to create an array from the querySelectorAll() result.
const puppeteer = require('puppeteer')
puppeteer.launch({ headless: true }).then(async browser => {
    const page = await browser.newPage()
    await page.goto('https://pubchem.ncbi.nlm.nih.gov/#query=MES')

    await page.waitForSelector('#featured-results > div:nth-child(2) > div.box-shadow > div > div.p-md-rectangle.flex-container.flex-nowrap.width-100 > div.flex-grow-1.p-md-left > div.f-medium.p-sm-top.p-sm-bottom.f-1125 > a')

    const links = await page.evaluate(
        () => [...document.querySelectorAll('#featured-results > div:nth-child(2) > div.box-shadow > div > div.p-md-rectangle.flex-container.flex-nowrap.width-100 > div.flex-grow-1.p-md-left > div.f-medium.p-sm-top.p-sm-bottom.f-1125 > a')]
                .map(link => link.href)
    )
    links.forEach(link => console.log(link))

    await browser.close()
})
  • Related