I am using puppeteer & cheerio and new to this. Here is the pertinent HTML page source code snippet:
<section class="descr">
<div class="center">
<a class="mfp-image" href="https://site.pics/store/1234/cat/img.jpg" title="Full size: 642x642" target="_blank"><img class="lazy 123" src="/assets/images/blank.gif" data-src="https://site.pics/store/1234/cat/th_img.jpg" alt="Image"></a>
</div>
<div class="info">JPG | 500px | 1MB 22.11.2021</div>
<hr id='more-3948099'>
<br>
<div class="blockSpoiler dl-links"><span class="fixHeader" id="download-links"></span><i class="sa sa-download-spoiler pl1em"></i><span class="blockTitle pl0">Get from file storage </span></div>
<div class="blockSpoiler-content txtleft c-dl-links"><a rel="external nofollow noopener" href="https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html" target="_blank">HOST1</a>
<br><a rel="external nofollow noopener" href="https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin" target="_blank">HOST2</a>
<br><a rel="external nofollow noopener" href="http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin" target="_blank">HOST3</a>
<br><a rel="external nofollow noopener" href="https://www.link4.com/riwtuwz9vjr3" target="_blank">HOST4</a>
<br>
</div>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
I need to get these links:
- https://site.pics/store/1234/cat/img.jpg
- https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html
- https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin
- http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin
- https://www.link4.com/riwtuwz9vjr3
Please note that there could be a link5 also in some cases (not shown in this case)
I used this code in the Chrome Developer tools:
document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML
document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML
I am able to get a lot of text that includes what is needed, along with unwanted text too. I have been trying for more than just a few hours, but not able to make any more progress.
When i write code using cheerio, I do not get any useful output:
- const html = await page.content();
- const $ = cheerio.load(html);
- console.log($("div.blockSpoiler-content.txtleft.c-dl-links"));
- console.log($("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML);
- console.log($("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML);
Any help is appreciated.
CodePudding user response:
This should help.
const $ = cheerio.load(html);
var urls = $('a[href]').map(function() {return $(this).attr('href') || '';}).toArray();
console.log('urls', urls);
CodePudding user response:
In this case though, using puppeteer is better:
let urls = await page.$$eval('a', as => as.map(a => a.href))