Home > Mobile >  Why not able to scrape all pages from a website with BeautifulSoup?
Why not able to scrape all pages from a website with BeautifulSoup?

Time:01-25

I'm trying to get all the data from all pages, i used a counter and cast it to take the page number in the url then looped using this counter but always the same result This is my code :

    # Scrapping job offers from hello work website

#import libraries
import random
import requests
import csv
from bs4 import BeautifulSoup
from datetime import date

#configure user agent for mozilla browser

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0",
    "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
    "Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"
        ]

random_user_agent= random.choice(user_agents)
headers = {'User-Agent': random_user_agent}

here where i have used my counter:

i=0
    for i in range(1,15):  
        url = 'https://www.hellowork.com/fr-fr/emploi/recherche.html?p=' str(i)
        print(url)
        page = requests.get(url,headers=headers)
        if (page.status_code==200):
         soup = BeautifulSoup(page.text,'html.parser')
         jobs = soup.findAll('div',class_=' new action crushed hoverable !tw-p-4 md:!tw-p-6 !tw-rounded-2xl')
    
              #config csv
    
         csvfile=open('jobList.csv','w ',newline='')
         row_list=[] #to append list of job 
    
         try :
                writer=csv.writer(csvfile)
                writer.writerow(["ID","Job Title","Company Name","Contract type","Location","Publish time","Extract Date"])
                for job in jobs:
                  id = job.get('id')
                  jobtitle= job.find('h3',class_='!tw-mb-0').a.get_text()
                  companyname = job.find('span',class_='tw-mr-2').get_text()
                  contracttype = job.find('span',class_='tw-w-max').get_text()
                  location = job.find('span',class_='tw-text-ellipsis tw-whitespace-nowrap tw-block tw-overflow-hidden 2xsOld:tw-max-w-[20ch]').get_text()
                  publishtime = job.find('span',class_='md:tw-mt-0 tw-text-xsOld').get_text()
                  extractdate = date.today()

              row_list=[[id,jobtitle,companyname,contracttype,location,publishtime,extractdate]]
              writer.writerows(row_list)
     finally:
            csvfile.close()

CodePudding user response:

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs


BeautifulSoup is not necessary needed here - You could get all and more information directly via api using a mix of requests and pandas - Check all available information here:

https://www.hellowork.com/searchoffers/getsearchfacets?p=1

Example

import requests
import pandas as pd
from datetime import datetime
   
df = pd.concat(
    [
        pd.json_normalize(
            requests.get(f'https://www.hellowork.com/searchoffers/getsearchfacets?p={i}', headers={'user-agent':'bond'}).json(), record_path=['Results']
        )[['ContractType','Localisation', 'OfferTitle', 'PublishDate', 'CompanyName']]

        for i in range(1,15)
    ],
    ignore_index=True
)

df['extractdate '] = datetime.today().strftime('%Y-%m-%d')

df.to_csv('jobList.csv', index=False)

Output

ContractType Localisation OfferTitle PublishDate CompanyName extractdate
0 CDI Beaurepaire - 85 Chef Gérant H/F 2023-01-24T16:35:15.867 Armonys Restauration - Morbihan 2023-01-24
1 CDI Saumur - 49 Dessinateur Métallerie Débutant H/F 2023-01-24T16:35:14.677 G2RH 2023-01-24
2 Franchise Villenave-d'Ornon - 33 Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F 2023-01-24T16:35:13.707 Elysée Concept 2023-01-24
3 Franchise Montpellier - 34 Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F 2023-01-24T16:35:12.61 Elysée Concept 2023-01-24
4 CDD Monaco Spécialiste Senior Développement Matières Premières Cosmétique H/F 2023-01-24T16:35:06.64 Expectra Monaco 2023-01-24
...
275 CDI Brétigny-sur-Orge - 91 Magasinier - Cariste H/F 2023-01-24T16:20:16.377 DELPHARM 2023-01-24
276 CDI Lille - 59 Technicien Helpdesk Français - Italien H/F 2023-01-24T16:20:16.01 Akkodis 2023-01-24
277 CDI Tours - 37 Conducteur PL H/F 2023-01-24T16:20:15.197 Groupe Berto 2023-01-24
278 Franchise Nogent-le-Rotrou - 28 Courtier en Travaux de l'Habitat pour Particuliers et Professionnels H/F 2023-01-24T16:20:12.29 Elysée Concept 2023-01-24
279 CDI Cholet - 49 Ingénieur Assurance Qualité H/F 2023-01-24T16:20:10.837 Akkodis 2023-01-24
  • Related