Home > Mobile >  Crawling text from website with Python Requests and BeautifulSoup fails
Crawling text from website with Python Requests and BeautifulSoup fails

Time:10-05

I would like to read some job ads automatically. For this, I implemented the procedure below which works quite well for most web pages:

def getTextFromWeb(url):
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    temp = soup.findAll(text=True)
    xvec = []
    for x in temp:
        if (len(x) > 1):
            xvec.append(x)
    text = '\n'.join(xvec)
    return text

However, I'm not able to read in the relevant text for a web page like this one:

https://jobs.swp.de/jobs/4775408/Regionalleiter_(m_w_d)_Donaueschingen___Freudenst

Any ideas how to enhance the procedure above in order to be able to import this text? Thanks a lot!

CodePudding user response:

The data is within the <script> tags in the source html. You need to parse from there which comes in json format:

from bs4 import BeautifulSoup 
import requests 
import json

url = 'https://jobs.swp.de/jobs/4775408/Regionalleiter_(m_w_d)_Donaueschingen___Freudenst'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find('script', {'type':'application/ld json'})

jsonData = json.loads(script.text)
print(jsonData['description'])
  • Related