I'm trying to crawl a webpage using node and cheerio. Everything is returning as I expect except for the hrefs.
I'm successfully returning values for 'headers' .find('h3').text()
and 'descriptions' .find('a').text()
but for 'links' .find('a').attr('href');
only the first is being returned. This confuses me as the text 'descriptions' are within the same anchor.
I've found that if I remove the .attr('href');
and just return .find('a')
the link text (href) is displayed as expected. I can modify the returned value and make this work if need be but would prefer to do this correctly.
Script:
const cheerio = require("cheerio");
const axios = require("axios");
axios.get("http://localhost:8000/sample_page_2.html").then(urlResponse => {
const $ = cheerio.load(urlResponse.data);
$('div.tos-post-type').each((i, element) => {
const header = $(element)
.find('h3')
.text()
.trim();
console.log('------------------------------------------------------------------------------------');
console.log('HEADER: ' header);
const link = $(element)
.find('a')
.attr('href');
console.log('\nLINK(s): \n' link);
const description = $(element)
.find('a')
.text();
console.log('\nDESCRIPTION(s): \n' description '\n');
console.log('------------------------------------------------------------------------------------');
});
});
Here is a snippet of the page I'm trying to crawl:
<div class="container tos-archive">
<div class="row justify-content-center">
<div class="col-lg-10">
<div class="row">
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/legal.svg )"></div>
<h3>
Legal </h3>
<a href="https://www.example_domain.com/legal/terms-conditions/">
Terms & Conditions </a>
<a href="https://www.example_domain.com/legal/service-providers/">
Service Providers </a>
</div>
</div>
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/policy.svg )"></div>
<h3>
Policies </h3>
<a target="" href="https://www.example_domain.com/privacy-policy/">
Privacy Policy </a>
<a target="" href="https://store.example_domain.com/EXHM/store?Action=DisplayEXCookiesPolicyPage">
Cookie Policy </a>
</div>
</div>
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/clip-dark.svg )"></div>
<h3>
<a href="https://www.example_domain.com/compliance/">
Compliance </a>
</h3>
<a href="https://www.example_domain.com/compliance/ccpa/">
California Consumer Privacy Act (CCPA) </a>
<a href="https://www.example_domain.com/compliance/disaster-recovery/">
Disaster Recovery </a>
<a href="https://www.example_domain.com/compliance/gdpr/">
GDPR </a>
<a href="https://www.example_domain.com/compliance/pci-dss/">
PCI DSS </a>
<a href="https://www.example_domain.com/compliance/privacymark/">
PrivacyMark </a>
<a class="tos-view-all" href="https://www.example_domain.com/compliance/">
View All </a>
</div>
</div>
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/mouse.svg )"></div>
<h3>
Other </h3>
<a href="https://www.example_domain.com/legal-other/eu-standard-solutions/">
EU Standard Solutions </a>
<a href="https://www.example_domain.com/legal-other/eu-standard-service-providers/">
EU Standard Service Providers </a>
<a href="https://www.example_domain.com/legal-other/data-exhibit/">
Data Exhibit </a>
<a href="https://www.example_domain.com/legal-other/data-standards/">
Data Standards </a>
<a href="https://www.example_domain.com/legal-other/payment-addenda/">
Payment Addenda </a>
</div>
</div>
</div>
</div>
</div>
</div>
Here's a snippet of actual results:
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Policies
LINK(s):
https://www.example_domain.com/privacy-policy/
DESCRIPTION(s):
Privacy Policy
Cookie Policy
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Compliance
LINK(s):
https://www.example_domain.com/compliance/
DESCRIPTION(s):
Compliance
California Consumer Privacy Act (CCPA)
Disaster Recovery
GDPR
PCI DSS
PrivacyMark
View All
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
Here is what I am expecting (multiple links):
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Policies
LINK(s):
https://www.example_domain.com/privacy-policy/
https://store.example_domain.com/EXHM/store?Action=DisplayEXCookiesPolicyPage
DESCRIPTION(s):
Privacy Policy
Cookie Policy
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Compliance
LINK(s):
https://www.example_domain.com/compliance/
https://www.example_domain.com/compliance/ccpa/
https://www.example_domain.com/compliance/disaster-recovery/
https://www.example_domain.com/compliance/gdpr/
https://www.example_domain.com/compliance/pci-dss/
https://www.example_domain.com/compliance/privacymark/
https://www.example_domain.com/compliance/
DESCRIPTION(s):
Compliance
California Consumer Privacy Act (CCPA)
Disaster Recovery
GDPR
PCI DSS
PrivacyMark
View All
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
Any ideas what I'm doing incorrectly?
Thanks!
CodePudding user response:
Use map to get multiple attributes:
$(element).find('a').get().map(a => $(a).attr('href'))