I would like to read some job ads automatically. For this, I implemented the procedure below which works quite well for most web pages:
def getTextFromWeb(url):
website = requests.get(url)
soup = BeautifulSoup(website.content)
temp = soup.findAll(text=True)
xvec = []
for x in temp:
if (len(x) > 1):
xvec.append(x)
text = '\n'.join(xvec)
return text
However, I'm not able to read in the relevant text for a web page like this one:
https://jobs.swp.de/jobs/4775408/Regionalleiter_(m_w_d)_Donaueschingen___Freudenst
Any ideas how to enhance the procedure above in order to be able to import this text? Thanks a lot!
CodePudding user response:
The data is within the <script>
tags in the source html. You need to parse from there which comes in json format:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://jobs.swp.de/jobs/4775408/Regionalleiter_(m_w_d)_Donaueschingen___Freudenst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find('script', {'type':'application/ld json'})
jsonData = json.loads(script.text)
print(jsonData['description'])