I'm trying to web scrape content from 2 different divs that are on the same level. I'm using NodeJS, Axios, Cheerio and Express.
Basically, I'm trying to collect an image and the info related to it, but they are placed of different divs that are on the same level. Using the "main" doesn't seem to work in my case.
<div >
<div >
// image
</div>
<div >
// info
</div>
</div>
Below is my code to get the data from a website:
var leafletList = $('.store-flyer__info', html).each(function() {
let leaflet = {
title: $(this).find('h3').text(),
image: $(this).find('source').attr('srcset'),
link: $(this).find('a').attr('href'),
validDate: $(this).find('small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
Below is the website's HTML:
The way my code is right now, it's obviously getting only the title, link and validDate. But anyone knows how can I get the the srcset from the other div? I've also tried the following method, but it doesn't work:
var leafletList = $('.store-flyers', html).each(function() {
let leaflet = {
title: $(this).find('.store-flyer__info h3').text(),
image: $(this).find('.store-flyer__front source').attr('srcset'),
link: $(this).find('.store-flyer__info a').attr('href'),
validDate: $(this).find('.store-flyer__info small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
CodePudding user response:
With cheerio, you can access node properties such as:
parentNode
previousSibling
nextSibling
nodeValue
firstChild
childNodes
lastChild
<div >
<div >
// image
</div>
<div >
// info
</div>
</div>
.main.firstChild is .one
.one.nextSibling is .two
.main.lastChild is .two
.two.previousSibling is .one
CodePudding user response:
There are many ways to get the result based on the HTML snippet you show, with the caveat that the developer tools can be misleading. It shows elements created after page load with JS, which you won't have if you're only requesting the raw page HTML.
With that in mind, here are a few options:
const cheerio = require("cheerio"); // ^1.0.0-rc.12
const html = `
<div >
<picture>
<source srcset="foo.jpeg" type="image/webp">
<source srcset="bar.jpeg" type="image/jpeg">
</picture>
</div>
<div >
<picture>
<source srcset="quux.jpeg" type="image/webp">
<source srcset="garply.jpeg" type="image/jpeg">
</picture>
</div>
`;
const $ = cheerio.load(html);
const result = [...$(".store-flyer")].map(e => ({
// select using `.first()` and `.last()` Cheerio methods:
firstImage: $(e).find("source").first().attr("srcset"),
secondImage: $(e).find("source").last().attr("srcset"),
// select using CSS attribute selectors:
firstImageByType: $(e).find('source[type="image/webp"]').attr("srcset"),
secondImageByType: $(e).find('source[type="image/jpeg"]').attr("srcset"),
// select as an array of all <source> elements:
allImages: [...$(e).find("source")].map(e => $(e).attr("srcset")),
}));
console.log(result);
Output:
[
{
firstImage: 'foo.jpeg',
secondImage: 'bar.jpeg',
firstImageByType: 'foo.jpeg',
secondImageByType: 'bar.jpeg',
allImages: [ 'foo.jpeg', 'bar.jpeg' ]
},
{
firstImage: 'quux.jpeg',
secondImage: 'garply.jpeg',
firstImageByType: 'quux.jpeg',
secondImageByType: 'garply.jpeg',
allImages: [ 'quux.jpeg', 'garply.jpeg' ]
}
]
Prepending .store-flyer__front
to your source
selectors might be a good idea if you need to disambiguate.