Home > database >  How to scrape image src the right way using puppeteer?
How to scrape image src the right way using puppeteer?

Time:08-01

This is the function I created to scrape image src but it doesn't return me the source. It's very strange because if I get the alt attribute instead it works fine. Does anyone know why this code wouldn't work?

const fetchImgSrc = await page.evaluate(() => {
  const img = document.querySelectorAll(
    "#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
  );
  let src = [];
  for (let i = 0; i < img.length; i  ) {
    src.push(img[i].getAttribute("src"));
  }
  return src;
});

CodePudding user response:

To return the src link for the two menu images on this page you can use

const fetchImgSrc = await page.evaluate(() => {
    const img = document.querySelectorAll('.swiper-lazy-loaded');
    let src = [];
    for (let i = 0; i < img.length; i  ) {
       src.push(img[i].getAttribute("src"));
    }
    return src;
});

This gives us the expected output

['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']

CodePudding user response:

You have two issues here:

  1. Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have src-s). You need to set your puppeteer browser to a bigger size with page.setViewport.
  2. Element.getAttribute is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is the src property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.

By the way: you can shorten your script with page.$$eval like this:

await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)

Output:

[
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]
  • Related