Home > other >  Extract links using cheerio (with puppeteer)
Extract links using cheerio (with puppeteer)

Time:11-23

I am using puppeteer & cheerio and new to this. Here is the pertinent HTML page source code snippet:

<section class="descr">
  <div class="center">
    <a class="mfp-image" href="https://site.pics/store/1234/cat/img.jpg" title="Full size: 642x642" target="_blank"><img class="lazy 123" src="/assets/images/blank.gif" data-src="https://site.pics/store/1234/cat/th_img.jpg" alt="Image"></a>
  </div>
  <div class="info">JPG | 500px | 1MB 22.11.2021</div>
  <hr id='more-3948099'>
  <br>
  <div class="blockSpoiler dl-links"><span class="fixHeader" id="download-links"></span><i class="sa sa-download-spoiler pl1em"></i><span class="blockTitle pl0">Get from file storage </span></div>
  <div class="blockSpoiler-content txtleft c-dl-links"><a rel="external nofollow noopener" href="https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html" target="_blank">HOST1</a>
    <br><a rel="external nofollow noopener" href="https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin" target="_blank">HOST2</a>
    <br><a rel="external nofollow noopener" href="http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin" target="_blank">HOST3</a>
    <br><a rel="external nofollow noopener" href="https://www.link4.com/riwtuwz9vjr3" target="_blank">HOST4</a>
    <br>
  </div>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

I need to get these links:

  1. https://site.pics/store/1234/cat/img.jpg
  2. https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html
  3. https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin
  4. http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin
  5. https://www.link4.com/riwtuwz9vjr3

Please note that there could be a link5 also in some cases (not shown in this case)

I used this code in the Chrome Developer tools:

document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML

document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML

I am able to get a lot of text that includes what is needed, along with unwanted text too. I have been trying for more than just a few hours, but not able to make any more progress.

When i write code using cheerio, I do not get any useful output:

  • const html = await page.content();
  • const $ = cheerio.load(html);
  • console.log($("div.blockSpoiler-content.txtleft.c-dl-links"));
  • console.log($("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML);
  • console.log($("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML);

Any help is appreciated.

CodePudding user response:

This should help.

const $ = cheerio.load(html);
var urls = $('a[href]').map(function() {return $(this).attr('href') || '';}).toArray();
console.log('urls', urls);

CodePudding user response:

In this case though, using puppeteer is better:

let urls = await page.$$eval('a', as => as.map(a => a.href))
  • Related