Home > OS >  Python Webscraping looping pages
Python Webscraping looping pages

Time:02-12

I recently started my very first Data Science project. I want to analyze specific job offers and therefore need to gather some data from a job portal.

Unfortunately I am already stuck at the very beginning. I seem to have some troubles with looping trough pages. I know there are already similar questions but none of the answers seems to help me (or maybe I simply do not understand them)

When scraping a single page I get exactly the result I am looking for

e.g.

       Firma: Greiner AG , Job:  Controller (m/w/d)  , Arbeitsort: Sattledt , Online seit 8.2.2022

but as soon as I try to loop through pages I get an error message

Traceback (most recent call last): File "e:\Programmieren\Projects\Webscraping\laola1_scraper.py", line 18, in job_title = jobs.find('h2', class_ = 'm-jobsListItem__title').text AttributeError: 'NoneType' object has no attribute 'text'

I also already tested to start with page 2 --> in that case I get a result 5 lines as intended and after that there is the error message again

I checked the position of the website where my code breaks but there is for sure no change in structure like mentioned in other cases

sitting here for almost 3 hours now but can't find any solution - guess it's pretty simple but what do i miss?

import requests
from bs4 import BeautifulSoup as bs
    
    url_job = "https://www.karriere.at/jobs/controller-controlling/oberösterreich-zentralraum"
    
    #response = requests.get(url_job)
    
    
    for page in range(2,10):
    
        response = requests.get(url_job   "?page="   str(page))
        data = bs(response.content, 'lxml')
    
    
        job = data.find_all('li', class_ = 'm-jobsList__item')
    
        for jobs in job:
                job_title = jobs.find('h2', class_ = 'm-jobsListItem__title').text
                job_company = jobs.find('div', class_ ='m-jobsListItem__company').text
                job_location = jobs.find('li', class_ ='m-jobsListItem__location').text
                job_date = jobs.find('span', class_ ='m-jobsListItem__date').text.replace("am","")
                
                print(f'''
                Firma:{job_company}, Job:{job_title}, Arbeitsort:{job_location}, Online seit{job_date}
                ''')

Thanks in advance

best, bones

CodePudding user response:

Your code is almost ok, but you want to skip specific items (e.g. ads) which don't contain job offer:

import requests
from bs4 import BeautifulSoup as bs

url_job = "https://www.karriere.at/jobs/controller-controlling/oberösterreich-zentralraum"

for page in range(10):
    response = requests.get(url_job   "?page="   str(page))
    data = bs(response.content, "lxml")

    job = data.find_all("li", class_="m-jobsList__item")

    for jobs in job:
        # skip specific classes:
        if jobs.select_one(".m-brandingSolutionAdCard, .m-alarmDisruptor"):
            continue

        job_title = jobs.find("h2", class_="m-jobsListItem__title").text
        job_company = jobs.find("div", class_="m-jobsListItem__company").text
        job_location = jobs.find("li", class_="m-jobsListItem__location").text
        job_date = jobs.find(
            "span", class_="m-jobsListItem__date"
        ).text.replace("am", "")

        print(
            f"""Firma:{job_company}, Job:{job_title}, Arbeitsort:{job_location}, Online seit{job_date}"""
        )

Prints:

Firma: Oberbank AG , Job:  MitarbeiterIn Kostenmanagement (m/w/d)  , Arbeitsort: Linz , Online seit 5.2.2022
Firma: TOURISMUSVERBAND Region WELS , Job:  MitarbeiterIn Buchhaltung und Kostenrechnung (Teilzeit bis 20 h)  , Arbeitsort: Wels , Online seit 8.2.2022
Firma: Oberbank AG , Job:  Junior Controller (m/w/d) - Karrierechance für BerufseinsteigerInnen  , Arbeitsort: Linz , Online seit 8.2.2022
Firma: WIFI Oberösterreich , Job:  Controlling/Kostenrechnung  , Arbeitsort: Linz , Online seit 7.2.2022
Firma: Schlüsselbauer Technology GmbH & Co KG , Job:  Kostenrechner - Controller (m/w/d)  , Arbeitsort: Gaspoltshofen , Online seit 1.2.2022
Firma: ISG Personalmanagement GmbH , Job:  Controller - Schwerpunkt HR (m/w/d)  , Arbeitsort: Linz , Online seit 9.2.2022
Firma: ISG Personalmanagement GmbH , Job:  Financial Controller (m/w/d)  , Arbeitsort: Linz , Online seit 5.2.2022
Firma: Schulmeister Finance , Job:  (Junior) Controller mit ausgezeichneten Entwicklungsmöglichkeiten (m/w/d)  , Arbeitsort: Wels , Online seit 9.2.2022
Firma: Schulmeister Finance , Job:  Controller (m/w/d) für Non Profit Organisation  , Arbeitsort: Linz , Online seit 2.2.2022
Firma: VACE Engineering GmbH , Job:  Financial Controller (m/w/d)  , Arbeitsort: Linz , Online seit 10.2.2022
Firma: Schulmeister Finance , Job:  (Senior) Controller (m/w/d)  , Arbeitsort: Linz , Online seit 10.2.2022
Firma: ÖSWAG Maschinenbau GmbH , Job:  Controller / Bilanzbuchhalter (m/w/d)  , Arbeitsort: Linz , Online seit 10.2.2022
Firma: Maschinenring Personal und Service eGen , Job:  (Senior-)Controller/in (m/w/d)  , Arbeitsort: Linz , Online seit 10.2.2022
Firma: Schulmeister Finance , Job:  Junior-Controller (m/w/d)  , Arbeitsort: Linz , Online seit 9.2.2022
Firma: Schulmeister Finance , Job:  Senior Controller (m/w/d) für innovatives Geschäftsfeld  , Arbeitsort: Linz , Online seit 9.2.2022
Firma: Schulmeister Finance , Job:  Serviceorientierter Controller mit Hands-On-Mentalität (m/w/d)  , Arbeitsort: Linz , Online seit 9.2.2022
Firma: TGW Logistics Group , Job:  Group Controller (m/w/d)  , Arbeitsort: Marchtrenk , Online seit 9.2.2022

...and so on.
  • Related