Home > Software design >  Scrape function only returns one of the links instead of all of the products im trying to get
Scrape function only returns one of the links instead of all of the products im trying to get

Time:06-02

Im really new to node.js and puppeteer.
Im trying to get all the links of the products when I search for them and save them but it only saves the first one and not the others. I dont know if my selectors are wrong or if the code is wrong.

const scraperObject = {
    url: 'https://diaonline.supermercadosdia.com.ar/busca/?ft=pepsi',
    async scraper(browser){
        let page = await browser.newPage();
        console.log(`Navigating to ${this.url}...`);
        await page.goto(this.url);
        // Wait for the required DOM to be rendered
        await page.waitForSelector('.wrapper > .main');
        // Get the link to all the required products
        let urls = await page.$$eval('section > div.coleccion-prods > div > div.vitrine.resultItemsWrapper', links => { 
             
            links = links.filter(link => link.querySelector('.marca').textContent !== "PEPSI")
            //Extract the links from the data
            links = links.map(el => el.querySelector('h3 > a').href)
            return links;

        });
        console.log(urls);
        
    }
}

module.exports = scraperObject;

node.js vrs. 18.1.0

puppeteer vrs. 14.1.2

CodePudding user response:

Try something like this:

import ppt from 'puppeteer';

const url = 'https://diaonline.supermercadosdia.com.ar/busca/?ft=pepsi';
const selectors = {
  main: '.wrapper > .main',
  productLink: "//div[contains(@class, 'product-name')]/h3/a/@href",
};

(async () => {
  const browser = await ppt.launch({
    headless: true,
    devtools: false,
    args: [
      '--window-size=1600,1200',
      '--disable-web-security',
      '--disable-site-isolation-trials',
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-infobars',
      '--window-position=0,0',
      '--ignore-certifcate-errors',
      '--ignore-certifcate-errors-spki-list',
    ],
  });

  const page = await browser.newPage();
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582');
  await page.goto(url);
  await page.waitForSelector(selectors.main);
  await page.waitForXPath(selectors.productLink, { timeout: 3000 });
  const handles = await page.$x(selectors.productLink);
  const links = await Promise.all(handles.map((h) => h.evaluate((elm) => elm.textContent)));
  await browser.close();
  console.log(links);
})();
[
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-3-lts-120091/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-500-ml-56079/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-black-15-lts-247793/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-black-225-lts-239383/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-pepsi-black-lata-354-cc-275841/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-light-225-lts-108411/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-pepsi-cola-225-lts-199640/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-en-lata-354-ml-68235/p',
  'https://diaonline.supermercadosdia.com.ar/gaseosa-cola-pepsi-15-lts-39692/p'
]

If I may suggest, take a look at XPath selectors. They are much more versatile than standard css selectors.

  • Related