Home > Software engineering >  How to scrape multiple href values?
How to scrape multiple href values?

Time:12-03

Hello, I want to pull the links from this page. All the knowledge in that field comes in according to my own methods. But I just need the links. How can I scrape links?(Pyhton-Beautifulsoup)

make_list = base_soup.findAll('div', {'a class': 'link--muted no--text--decoration result-item'})
one_make = make_list.findAll('href')
print(one_make)

The structure to extract the data is as follows:

<div  data-testid="no-top"> == $0
<a  href="https://link structure" 

Every single link I want to collect is here.(link structure)

I tried methods like.Thank you very much in advance for your help.

CodePudding user response:

Note: In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

Iterate your ResultSet and extract the value of href attribute:

make_list = soup.find_all('a', {'class': 'link--muted no--text--decoration result-item'})
    for e in make_list:
        print(e.get('href'))

Example

from bs4 import BeautifulSoup

html='''
<div  data-testid="no-top">
    <a  href="https://link structure"></a>
</div>
<div  data-testid="no-top">
    <a  href="https://link structure"></a>
</div>
'''
soup = BeautifulSoup(html)

make_list = soup.find_all('a', {'class': 'link--muted no--text--decoration result-item'})
for e in make_list:
    print(e.get('href'))

CodePudding user response:

This is an example of code on how you can achieve that

from bs4 import BeautifulSoup

html = ''' 
<div  data-testid="no-top"> == $0
    <a  href="https://link structure"></a>
</div>
<div  data-testid="no-top"> == $0
    <a  href="https://link example.2"></a>
</div>
'''

soup = BeautifulSoup(html, features="lxml")
anchors = soup.find_all('a')

for anchor in anchors:
    print(anchor['href'])

Alternatively, you can use a third-party service such as WebScrapingAPI to achieve your goal. I recommend this service since because it is beginner friendly and it offers CSS extracting and many advanced features such as IP rotations, rendering javascript, CAPTCHA solving, custom geolocation and many more which you can find out about by checking the docs. This in an example of how you can get links from a webpage using WebScrapingAPI:

from bs4 import BeautifulSoup
import requests
import json

API_KEY = '<YOUR-API-KEY>'

SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://docs.webscrapingapi.com/'

PARAMS = {
    "api_key":API_KEY,
    "url": TARGET_URL,
    "extract_rules": '{"linksList": {"selector": "a[href]", "output": "html", "all": 1 }}',
}

response = requests.get(SCRAPER_URL, params=PARAMS)
parsed_result = json.loads(response.text)

linksList = parsed_result['linksList']

for link in linksList:
    soup = BeautifulSoup(link, features='lxml')
    print(soup.find('a').get('href'))

If you are interested you can check more information about this on our Extraction Rules Docs

  • Related