This is the function I created to scrape image src but it doesn't return me the source. It's very strange because if I get the alt attribute instead it works fine. Does anyone know why this code wouldn't work?
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll(
"#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
);
let src = [];
for (let i = 0; i < img.length; i ) {
src.push(img[i].getAttribute("src"));
}
return src;
});
CodePudding user response:
To return the src link for the two menu images on this page you can use
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll('.swiper-lazy-loaded');
let src = [];
for (let i = 0; i < img.length; i ) {
src.push(img[i].getAttribute("src"));
}
return src;
});
This gives us the expected output
['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']
CodePudding user response:
You have two issues here:
- Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have
src
-s). You need to set your puppeteer browser to a bigger size withpage.setViewport
. Element.getAttribute
is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is thesrc
property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.
By the way: you can shorten your script with page.$$eval
like this:
await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)
Output:
[
'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]