Home > Back-end >  How to scrape from 2 divs that are on the same level with Cheerio
How to scrape from 2 divs that are on the same level with Cheerio

Time:02-05

I'm trying to web scrape content from 2 different divs that are on the same level. I'm using NodeJS, Axios, Cheerio and Express.

Basically, I'm trying to collect an image and the info related to it, but they are placed of different divs that are on the same level. Using the "main" doesn't seem to work in my case.

<div >
    <div >
        // image
    </div>
    <div >
        // info
    </div>
</div>

Below is my code to get the data from a website:

var leafletList = $('.store-flyer__info', html).each(function() {
    let leaflet = {
        title: $(this).find('h3').text(),
        image: $(this).find('source').attr('srcset'),
        link: $(this).find('a').attr('href'),
        validDate: $(this).find('small').text().slice(3,-1)
    }

    leaflets.push(leaflet)
})

Below is the website's HTML:

website's html

The way my code is right now, it's obviously getting only the title, link and validDate. But anyone knows how can I get the the srcset from the other div? I've also tried the following method, but it doesn't work:

var leafletList = $('.store-flyers', html).each(function() {
    let leaflet = {
        title: $(this).find('.store-flyer__info h3').text(),
        image: $(this).find('.store-flyer__front source').attr('srcset'),
        link: $(this).find('.store-flyer__info a').attr('href'),
        validDate: $(this).find('.store-flyer__info small').text().slice(3,-1)
    }

    leaflets.push(leaflet)
})

CodePudding user response:

With cheerio, you can access node properties such as:

parentNode
previousSibling
nextSibling
nodeValue
firstChild
childNodes
lastChild

<div >
    <div >
        // image
    </div>
    <div >
        // info
    </div>
</div>

.main.firstChild is .one

.one.nextSibling is .two

.main.lastChild is .two

.two.previousSibling is .one

CodePudding user response:

There are many ways to get the result based on the HTML snippet you show, with the caveat that the developer tools can be misleading. It shows elements created after page load with JS, which you won't have if you're only requesting the raw page HTML.

With that in mind, here are a few options:

const cheerio = require("cheerio"); // ^1.0.0-rc.12

const html = `
<div >
  <picture>
    <source srcset="foo.jpeg" type="image/webp">
    <source srcset="bar.jpeg" type="image/jpeg">
  </picture>
</div>
<div >
  <picture>
    <source srcset="quux.jpeg" type="image/webp">
    <source srcset="garply.jpeg" type="image/jpeg">
  </picture>
</div>
`;
const $ = cheerio.load(html);
const result = [...$(".store-flyer")].map(e => ({
  // select using `.first()` and `.last()` Cheerio methods:
  firstImage: $(e).find("source").first().attr("srcset"),
  secondImage: $(e).find("source").last().attr("srcset"),

  // select using CSS attribute selectors:
  firstImageByType: $(e).find('source[type="image/webp"]').attr("srcset"),
  secondImageByType: $(e).find('source[type="image/jpeg"]').attr("srcset"),

  // select as an array of all <source> elements:
  allImages: [...$(e).find("source")].map(e => $(e).attr("srcset")),
}));
console.log(result);

Output:

[
  {
    firstImage: 'foo.jpeg',
    secondImage: 'bar.jpeg',
    firstImageByType: 'foo.jpeg',
    secondImageByType: 'bar.jpeg',
    allImages: [ 'foo.jpeg', 'bar.jpeg' ]
  },
  {
    firstImage: 'quux.jpeg',
    secondImage: 'garply.jpeg',
    firstImageByType: 'quux.jpeg',
    secondImageByType: 'garply.jpeg',
    allImages: [ 'quux.jpeg', 'garply.jpeg' ]
  }
]

Prepending .store-flyer__front to your source selectors might be a good idea if you need to disambiguate.

  • Related