Home > Enterprise >  How do I scrape the name/text of the hyperlinks using python?
How do I scrape the name/text of the hyperlinks using python?

Time:03-09

I wanted to extract the name of the links from this URL https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html however, I can't move on to the next step. below is my code so far

import requests as re
from bs4 import BeautifulSoup

URL = "https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html"
page = re.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="post altr")

for result in results:
    print(result)

I still don't know how to go to the next step. Any help is very much appreciated. Thank you.

CodePudding user response:

This code gets every text of a link in the page:

import requests as re
from bs4 import BeautifulSoup

URL = "https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html"
page = re.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all('a')

for result in results:
    print(result.text.strip())

Output:

CCDA
port channels
RPVST
Dynamic Trunking Protocol
VTP transparent mode
Layer 3 load balancing
user ports
enable PortFast
the core layer
link redundancy
access layer switches
Gateway Load Balancing Protocol
core switches
distribution switches
redundant paths
campus core
Large Building LANs
LAN Design Types and Models
Shutting Down a BGP Neighbor
Core Layer Functionality - Network Design
Distribution Layer Functionality
Characterizing Types of Traffic Flow for New Network Applications
DHCP Starvation and Spoofing Attacks
How to Start an Ecommerce Business
Reply
About
Contact
Advertise
Privacy Policy
Resources

It works because in order to create a hyperlink in html the tag <a> is used. I believe you're asking for the blocks of text that happen to have have a hyperlink, but if you're asking for the links, here's how you can do it:

import requests as re
from bs4 import BeautifulSoup

URL = "https://www.ccexpert.us/ccda/best-practices-for-hierarchical-layers.html"
page = re.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

for a in soup.find_all('a', href=True):
    print(a['href'])

Output:

/
/reviews/traffic-xtractor.html



/ccda/
/routing-switching/using-routed-ports-and-portchannels-with-mls.html
/root-bridge/rapid-pervlan-spanning-tree-protocol.html
/network-security-2/dynamic-trunking-protocol-dtp.html
/root-bridge/vtp-modes.html
/root-bridge/configuring-etherchannel-load-balancing.html
/routing-switching-2/switch-security-best-practices-for-unused-and-user-ports.html
/global-configuration/enabling-bpdu-guard.html
/network-design/core-layer-functionality.html
/network-design/designing-link-redundancy.html
/network-design/access-layer-functionality.html
/root-bridge/gateway-load-balancing-protocol.html
/switching/collapsed-core.html
/switching/distribution-layer-switches.html
/switching/backbonefast-redundant-backbone-paths.html
/network-design/campus-core-design-considerations.html
/ccda/largebuilding-lans.html
/ccda/lan-design-types-and-models.html
/cisco-internetworks-2/shutting-down-a-bgp-neighbor.html
/network-design/core-layer-functionality.html
/network-design/distribution-layer-functionality.html
/network-design-2/characterizing-types-of-traffic-flow-for-new-network-applications.html
/snrs-3/dhcp-starvation-and-spoofing-attacks.html
/ecommerce.html
/about/
/contact/
/advertise-with-us/
/privacy-policy/
/resources/

This scrapes just the 'href' of each tag.

  • Related