Home > other >  Webscraping of dynamic content with Beautiful soup
Webscraping of dynamic content with Beautiful soup

Time:06-01

To train my python skills I tried to scrap the number of open jobs for a specific given job from the webpresence of the "Arbeitsagentur" (https://www.arbeitsagentur.de/jobsuche/). I used the web-developer inspection tool of the firefox browser to extract the text out of the item containing the information, e.g. "12.231 Jobs für Informatiker/in". My code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait

content = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker/in"
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(executable_path="C:/Drivers/geckodriver/geckodriver.exe", options=options)
driver.get(content)
soup = BeautifulSoup(driver.page_source, 'html.parser')
num_jobs = soup.select_one('div[] h2')
print(num_jobs)
driver.close()

As result I get the extraction of the correct line but it does not include the information queried. Translated in english I get this output:

<h2 _ngcontent-serverapp-c39=""  id="suchergebnis-h1-anzeige">Jobs for Informatiker/in are loaded</h2>

In the web-inspector of firefox I see instead:

<h2 id="suchergebnis-h1-anzeige"  _ngcontent-serverapp-c39="">
12.231 Jobs für Informatiker/in</h2>

I tried the WebDriverWait method and driver.implicitly_wait() to wait until the webpage is loaded completely but without success. Probably this value is calculated and inserted by a js-script(?). As I am not a web developer I don't know how this works and how to extract the line with the given number of jobs correctly. I tried to use the debugger of the firefox developer tools to see where / how the value is calculated. But most scripts are only very cryptic one-liners.

(Extracting the number/value out of the string / text line by means of a regular expression will be no problem at all).

I really would appreciate your support or any useful hint.

CodePudding user response:

Since the contents are dynamically loaded, you can parse the number of job result only after a certain element is visible, in that case, all elements will be loaded and you can successfully parse your desired data.

You can also increase the sleep time to load all data but that's a bad solution.

Working code -

import time

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")

chrome_driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)


def arbeitsagentur_scraper():
    URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker/in"
    with chrome_driver as driver:
        driver.implicitly_wait(15)
        driver.get(URL)
        wait = WebDriverWait(driver, 10)
        
        # time.sleep(10) # increase the load time to fetch all element, not advised solution
       
        # wait until this element is visible 
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))
        
        elem = driver.find_element(By.XPATH,
                                   '/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
        print(elem.text)


arbeitsagentur_scraper()

Output -

12.165 Jobs für Informatiker/in

CodePudding user response:

Alternatively, you can use their API URL to load the results. For example:

import json
import requests


api_url = "https://rest.arbeitsagentur.de/jobboerse/jobsuche-service/pc/v4/jobs"

query = {
    "angebotsart": "1",
    "was": "Informatiker/in",
    "page": "1",
    "size": "25",
    "pav": "false",
}

headers = {
    "OAuthAccessToken": "eyJhbGciOiJIUzUxMiJ9.eyAic3ViIjogIklkNFZSNmJoZFpKSjgwQ2VsbHk4MHI4YWpkMD0iLCAiaXNzIjogIk9BRyIsICJpYXQiOiAxNjU0MDM2ODQ1LCAiZXhwIjogMS42NTQwNDA0NDVFOSwgImF1ZCI6IFsgIk9BRyIgXSwgIm9hdXRoLnNjb3BlcyI6ICJhcG9rX21ldGFzdWdnZXN0LCBqb2Jib2Vyc2Vfc3VnZ2VzdC1zZXJ2aWNlLCBhYXMsIGpvYmJvZXJzZV9rYXRhbG9nZS1zZXJ2aWNlLCBqb2Jib2Vyc2Vfam9ic3VjaGUtc2VydmljZSwgaGVhZGVyZm9vdGVyX2hmLCBhcG9rX2hmLCBqb2Jib2Vyc2VfcHJvZmlsLXNlcnZpY2UiLCAib2F1dGguY2xpZW50X2lkIjogImRjZGVhY2JkLTJiNjItNDI2MS1hMWZhLWQ3MjAyYjU3OTg0OCIgfQ.BBkJbJ93fGqQQQGX4-VTzX8P6Twg8Rthq8meXV2WY_CoUmXQWhdgbjkFozP2BJXooSr7yLaTJr7JXEk8hDnCWA",
}

data = requests.get(api_url, params=query, headers=headers).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["maxErgebnisse"])

Prints:

12165
  • Related