I build on a web-scrapper, that, lets say scrap URLs from google
I get an array of URLs from google results:
const linkSelector = 'div.yuRUbf > a'
let links = await page.$$eval(linkSelector, link => {
return link.map( x => x.href)
})
the output of 'links' is something like that:
[
'https://google.com/.../antyhing'
'https://amazon.com/.../antyhing'
'https://twitter.com/.../antyhing'
]
Now I have a 'blacklist', with something like that:
[
'https://amazon.com'
]
At the moment I stuck at that point where I can compare both arrays, and remove these URLs from 'links' which are listed within my blacklist.
So I came up with the idea, to get the domain of the url within my links array - like so:
const linkList = []
for ( const link of links ) {
const url = new URL(link)
const domain = url.origin
linkList.push(domain)
}
Yes, now i got two arrays which i can compare against each other and remove the blacklisted domain, but i lost the complete url i need to work with...
for( let i = linkList.length - 1; i >= 0; i--){
for( let j=0; j < blacklist.length; j ){
if( linkList[i] === blacklist[j]){
linkList.splice(i, 1);
}
}
}
Code Snippet is part of the give answer, here: Compare two Javascript Arrays and remove Duplicates
Any ideas how can i do this, with puppeteer and node.js?
CodePudding user response:
I couldn't find an obvious dupe, so converting my comments to an answer:
.includes
:
const allowedLinks = links.filter(link => !blacklist.some(e => link.includes(e)))
.startsWith
:
const allowedLinks = links.filter(link => !blacklist.some(e => link.startsWith(e)))
The second version is more precise. If you want to use the URL version, this should work:
const links = [
"https://google.com/.../antyhing",
"https://amazon.com/.../antyhing",
"https://twitter.com/.../antyhing",
];
const blacklist = ["https://amazon.com"];
const allowedLinks = links.filter(link =>
!blacklist.some(black =>
black.startsWith(new URL(link).origin) // or use ===
)
);
console.log(allowedLinks);
As for Puppeteer, I doubt it matters whether you do this Node-side or browser-side, unless these arrays are enormous. On that train of thought, technically we have a quadratic algorithm here but I wouldn't worry about it unless you have many hundreds of thousands of elements and are noticing slowness. In that case, you can put the blacklisted origins into a Set
data and look up each link's origin in that. The problem with this is it's a precise ===
, so you'd have to build a prefix set if you need to preserve .startsWith
semantics. This is likely unnecessary and out of scope for this answer, but worth mentioning briefly.