Home > Enterprise >  BeautifulSoup getting href inside h-tag
BeautifulSoup getting href inside h-tag

Time:10-07

Good day! I just need the href="this-value" inside the h4 block. Bad thing is that this a href doesn't have any classes/ids. This is how the block looks like in html:

<h4  itemprop="name">
<a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>

Python Code:

page = requests.get(product_fetch_url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

product_fetch_url_class = "article_title_list"
product_fetch_url_html = "h4"

find_urls = soup.find_all('{0}'.format(product_fetch_url_html), class_='{0}'.format(product_fetch_url_class))

for row in find_urls:
    string = row
    print("Produkt: {0}".format(string))

    html = BeautifulSoup(string, "html.parser")
    
    for a in html.find('a', href=True):
        print("Produkt URL-Slug: {0}".format(a['href']))

Output:

Produkt: <h4  itemprop="name">
<a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4>

Traceback (most recent call last):
File "/usr/share/nginx/html/mp-masterdb/pokefri.de/scraper.py", line 45, in <module>
fetch_urls()
File "/usr/share/nginx/html/mp-masterdb/pokefri.de/scraper.py", line 38, in fetch_urls
html = BeautifulSoup(string, "html.parser")
File "/usr/lib/python3.10/site-packages/bs4/__init__.py", line 312, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable

Excpeted Output:

Produkt: <h4  itemprop="name"><a href="10-deutsche-pokemon-karten-sparpack">10 deutsche Pokemon Karten - mit Rare oder Holo/EX/GX - wie ein Booster!</a></h4> 

Produkt Url-slug: 10-deutsche-pokemon-karten-sparpack

Any ideas how to solve this problem earlier with BeautifulSoup instead of re/regex?

CodePudding user response:

If you simply try to fetch the links, select your elements a bit more specific.

for a in soup.select('h4>a'):
    print(a.get('href'))

Or if you like to go per row:

for e in soup.select('#product-list > div'):
    print(e.h4.a.get('href'))

Example

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.lotticards.de/pokemon-sammelkarten').text)

for e in soup.select('#product-list > div'):
    print(e.h4.a.get('href'))

Output

10-deutsche-pokemon-karten-sparpack
Glaenzendes-Schicksal-Booster-Deutsch
Pokemon-Celebrations-Booster-Packung-Deutsch
Pikachu-V-Kollektion-Glaenzendes-Schicksal-Deutsch
Verborgenes-Schicksal-Top-Trainer-Box
Sun-Moon-Tag-Team-All-Stars-GX-High-Class-Pack-SM12a-Display-Japanisch
Champions-Path-Elite-Trainer-Box-Englisch
Glaenzendes-Schicksal-Mini-Tin-Set-Alle-5-Motive-Deutsch
...

Or as list comprehension and based on itemprop="url":

[a.get('content') for a in soup.select('#product-list [itemprop="url"]')]

Output:

['https://www.lotticards.de10-deutsche-pokemon-karten-sparpack',
 'https://www.lotticards.deGlaenzendes-Schicksal-Booster-Deutsch',
 'https://www.lotticards.dePokemon-Celebrations-Booster-Packung-Deutsch',
 'https://www.lotticards.dePikachu-V-Kollektion-Glaenzendes-Schicksal-Deutsch',
 'https://www.lotticards.deVerborgenes-Schicksal-Top-Trainer-Box',
 'https://www.lotticards.deSun-Moon-Tag-Team-All-Stars-GX-High-Class-Pack-SM12a-Display-Japanisch',
 'https://www.lotticards.deChampions-Path-Elite-Trainer-Box-Englisch',
 'https://www.lotticards.deGlaenzendes-Schicksal-Mini-Tin-Set-Alle-5-Motive-Deutsch',
 'https://www.lotticards.deShining-Fates-Elite-Trainer-Box-Englisch',
 'https://www.lotticards.deHidden-Fates-Elite-Trainer-Box-Reprint-Januar-2021',
 'https://www.lotticards.deVMAX-Climax-s8b-Display-Japanisch',
 'https://www.lotticards.deSonne-Mond-Ultra-Prisma-Booster-Deutsch',
 'https://www.lotticards.desonne-mond-2-stunde-der-waechter-booster-deutsch-kaufen',
 'https://www.lotticards.deSchwert-Schild-Kampfstile-Display-Deutsch',
 'https://www.lotticards.dePokemon-Celebrations-Booster-Pack-Englisch',...]
  • Related