Compare two arrays with node.js and puppeteer-CodePudding

I build on a web-scrapper, that, lets say scrap URLs from google

I get an array of URLs from google results:

const linkSelector = 'div.yuRUbf > a'
let links = await page.$$eval(linkSelector, link => {
     return link.map( x => x.href)
})

the output of 'links' is something like that:

[
'https://google.com/.../antyhing'
'https://amazon.com/.../antyhing'
'https://twitter.com/.../antyhing'
]

Now I have a 'blacklist', with something like that:

[
'https://amazon.com'
]

At the moment I stuck at that point where I can compare both arrays, and remove these URLs from 'links' which are listed within my blacklist.

So I came up with the idea, to get the domain of the url within my links array - like so:

const linkList = []
for ( const link of links ) {

const url = new URL(link)
const domain = url.origin
linkList.push(domain)

}

Yes, now i got two arrays which i can compare against each other and remove the blacklisted domain, but i lost the complete url i need to work with...

for( let i = linkList.length - 1; i >= 0; i--){
  for( let j=0; j < blacklist.length; j  ){
    if( linkList[i] === blacklist[j]){
      linkList.splice(i, 1);
    }
  }
}

Code Snippet is part of the give answer, here: Compare two Javascript Arrays and remove Duplicates

Any ideas how can i do this, with puppeteer and node.js?

CodePudding user response：

I couldn't find an obvious dupe, so converting my comments to an answer:

.includes:

const allowedLinks = links.filter(link => !blacklist.some(e => link.includes(e)))

.startsWith:

const allowedLinks = links.filter(link => !blacklist.some(e => link.startsWith(e)))

The second version is more precise. If you want to use the URL version, this should work:

const links = [
  "https://google.com/.../antyhing",
  "https://amazon.com/.../antyhing",
  "https://twitter.com/.../antyhing",
];
const blacklist = ["https://amazon.com"];

const allowedLinks = links.filter(link =>
  !blacklist.some(black =>
    black.startsWith(new URL(link).origin) // or use ===
  )
);
console.log(allowedLinks);

As for Puppeteer, I doubt it matters whether you do this Node-side or browser-side, unless these arrays are enormous. On that train of thought, technically we have a quadratic algorithm here but I wouldn't worry about it unless you have many hundreds of thousands of elements and are noticing slowness. In that case, you can put the blacklisted origins into a Set data and look up each link's origin in that. The problem with this is it's a precise ===, so you'd have to build a prefix set if you need to preserve .startsWith semantics. This is likely unnecessary and out of scope for this answer, but worth mentioning briefly.