Home > database >  Load Cheerio/jQuery selectors, including methods as strings
Load Cheerio/jQuery selectors, including methods as strings

Time:07-27

I have a nodejs app, which uses cheerio for extracting parts of html from pages of multiple sites. The app runs through a JSON file and performs the scraping and extraction for each site, for all the URLS, and all cheerio.js queries for each URL:

"site1":{
    "urls":{
         "http://site1.com/pageA",
         "http://site1.com/pageB",
    },
    "queries":{
          "h1": "$('h1').text()"
          "numbersFromH1": "$('h1').text().match(/\\d /)[0]"
    } 
}

Loading the selectors, e.g. 'h1' using a variable and having a .text() method inside the app would be a no-brainer. However, sometimes I need to .match() or .filter() etc., sometimes chain the methods.

So, is there a way I could load the whole query (selector methods) and have cheerio execute it?

CodePudding user response:

To answer my own question right away: As a hotfix, I use eval() to parse the imported query and consume its output.

However, since eval is generally not recommended and I have doubts about its performance, I am still looking for a different solution. E.g by passing the whole query string (not just the selector) to cheerio, somehow..

CodePudding user response:

The queries should be functions that take $ as a parameter - because $ is dynamic (generated anew for each new document you parse). Then use that $ which is now in scope inside the function to call what you want on it - without any need for eval. For example:

const sites = {
  "site1":{
    "urls":[ // make sure this is an array, not an object
         "http://site1.com/pageA",
         "http://site1.com/pageB",
    ],
    "queries":{
          "h1": $ => $('h1').text(), // make sure to include comma after value
          "numbersFromH1": $ => $('h1').text().match(/\\d /)[0]
    } 
  }
};

const { site1 } = sites;
// replace with whatever implementation you have that retrieves page text
const pageText = await getPageText(site1.urls[0]);
const firstUrl$ = cheerio.load(pageText);
console.log(site1.queries.h1(firstUrl$));

Which should be easily adaptable to loops and more dynamic calls of the URLs and methods.

  • Related