I have a nodejs app, which uses cheerio for extracting parts of html from pages of multiple sites. The app runs through a JSON file and performs the scraping and extraction for each site, for all the URLS, and all cheerio.js queries for each URL:
"site1":{
"urls":{
"http://site1.com/pageA",
"http://site1.com/pageB",
},
"queries":{
"h1": "$('h1').text()"
"numbersFromH1": "$('h1').text().match(/\\d /)[0]"
}
}
Loading the selectors, e.g. 'h1'
using a variable and having a .text()
method inside the app would be a no-brainer. However, sometimes I need to .match()
or .filter()
etc., sometimes chain the methods.
So, is there a way I could load the whole query (selector methods) and have cheerio execute it?
CodePudding user response:
To answer my own question right away: As a hotfix, I use eval()
to parse the imported query and consume its output.
However, since eval is generally not recommended and I have doubts about its performance, I am still looking for a different solution. E.g by passing the whole query string (not just the selector) to cheerio, somehow..
CodePudding user response:
The queries should be functions that take $
as a parameter - because $
is dynamic (generated anew for each new document you parse). Then use that $
which is now in scope inside the function to call what you want on it - without any need for eval. For example:
const sites = {
"site1":{
"urls":[ // make sure this is an array, not an object
"http://site1.com/pageA",
"http://site1.com/pageB",
],
"queries":{
"h1": $ => $('h1').text(), // make sure to include comma after value
"numbersFromH1": $ => $('h1').text().match(/\\d /)[0]
}
}
};
const { site1 } = sites;
// replace with whatever implementation you have that retrieves page text
const pageText = await getPageText(site1.urls[0]);
const firstUrl$ = cheerio.load(pageText);
console.log(site1.queries.h1(firstUrl$));
Which should be easily adaptable to loops and more dynamic calls of the URLs and methods.