Home > Software engineering >  I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)
I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)

Time:12-10

I'm a complete beginner with webscraping and programming with Python. The answer might be somewhere at the forum, but i'm so new, that i dont really now, what to look for. So i hope, you can help me:

Last week I completed a three day course in webscraping with Python, and at the moment i'm trying to brush up on what i've learned so far.

I'm trying to scrape out a spcific link from a website, so that i later on can create a loop, that extracts all the other links. But i can't seem to extract any link even though they are visible in the HTML-code.

The link i'm trying extract is located in this html-code:

<a href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/" aria-label="Læs mere om Regionen tilbød ikke"\>Læs mere\</a\>

Here is the programming in Python, that i've tried so far:

url = "https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"

r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a") len(a_tags)

#there is 34 've then tried going through all "a-tags" from 0-33 without finding the link.

If i'm printing a_tags [26] - i'm getting this code:

<a aria-current="page" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"\>Afgørelser fra Styrelsen for Patientklager\</a\> Which is somewhere at the top of the website. But the next a_tag [27] is a code at the bottom of the site:

<a href="``https://www.linkedin.com/company/styrelsen-for-patientklager/``" rel="noopener" target="_blank" title="``https://www.linkedin.com/company/styrelsen-for-patientklager/``"><span >Linkedin profil</span></a>


Can anyone help me by telling me, how to access the specific part of the HTML-code, that contains the link?

When i find out how to pull out the link, my plan is to make the following programming:

path = "/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/"
full_url = f"htps://stpk.dk{path}"
print(full_url)

CodePudding user response:

You will not find what you are looking for, cause requests do not render websites like a browser will do - but no worry, there is an alterntive.

Content is dynamically loaded via api, so you should call these directly and you will get JSON that contains the displayed information.

To find such information take a closer look into the developer tools of your browser and check the tab for the XHR Requests - May take a minute to read and follow the topic: https://developer.mozilla.org/en-US/docs/Glossary/XHR_(XMLHttpRequest)

Simply iterate over the items, extract the url value and prepend the base_url.

Check and manipulate the following parameters to your needs:

containerKey: a76f4a50-6106-4128-bc09-a1da7695902b
query: 
year: 
category: 
legalTheme: 
specialty: 
profession: 
treatmentPlace: 
critiqueType: 
take: 200
skip: 0

Example

import requests

url = 'https://stpk.dk/api/verdicts/settlements/?containerKey=a76f4a50-6106-4128-bc09-a1da7695902b&query=&year=&category=&legalTheme=&specialty=&profession=&treatmentPlace=&critiqueType=&take=200&skip=0'
base_url = 'https://stpk.dk'
for e in requests.get(url).json()['items']:
    print(base_url e['url'])

Output

https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp107/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp106/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp105/
...
  • Related