A similar subject exists but I couldn't find the exact answer, so please could you help me?
I copied from the internet the following code to scrape job offers from indeed. The problem is the code cannot scrape job descriptions.
When using : sum_div = job.find_elements_by_class_name('summary')
The code doesn't identify 'summary' and doesn't get the place where the job description is, and it is also unable to close the pop-up that appears on indeed.
I tried other identifier like :sum_div = job.find_element_by_class_name('job_seen_beacon')
It goes over and closes the pop-up, but still isn't good to identify where the job description is.
Do you, please, have any idea how to solve this?
for i in range(0,50,10):
driver.get('https://www.indeed.co.in/jobs?q=artificial intelligence&l=India&start=' str(i))
jobs = []
driver.implicitly_wait(20)
for job in driver.find_elements_by_class_name('result'):
#soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
result_html = job.get_attribute('innerHTML')
soup = BeautifulSoup(result_html, 'html.parser')
try:
title = soup.find(class_="jobTitle").text
except:
title = 'None'
try:
location = soup.find(class_="companyLocation").text
except:
location = 'None'
try:
company = soup.find(class_="companyName").text.replace("\n","").strip()
except:
company = 'None'
sum_div = job.find_elements_by_class_name('summary')
#sum_div = job.find_element_by_class_name('job_seen_beacon')
try:
sum_div.click()
except:
close_button = driver.find_elements_by_class_name('popover-x-button-close')
close_button.click()
sum_div.click()
driver.implicitly_wait(2)
try:
job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
print(job_desc)
except:
job_desc = 'None'
df = df.append({'Title':title,'Location':location,"Company":company,
"Description":job_desc},ignore_index=True)
CodePudding user response:
The url isn't dynamic.So no need to use selenium.You can extract desired data using bs4 and requests.Below is given an example.
P/S: You may not use try except as each page contains 15 items equally.
from bs4 import BeautifulSoup
import requests
import pandas as pd
jobs = []
for i in range(0,50,10):
url='https://www.indeed.co.in/jobs?q=artificial intelligence&l=India&start=' str(i)
req=requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
for job in soup.select('.result'):
try:
title = job.find(class_="jobTitle").text
except:
title = 'None'
try:
location = job.find(class_="companyLocation").text
except:
location = 'None'
try:
company = job.find(class_="companyName").text.replace("\n","").strip()
except:
company = 'None'
try:
job_desc = job.select_one('div.job-snippet ul ').get_text(strip=True)
except:
job_desc = 'None'
jobs.append({'Title':title,'Location':location,"Company":company,"Description":job_desc})
df =pd.DataFrame(jobs)
print(df)
#to store data
#df.to_csv('data.csv',index=False)
Output:
Title Description
0 newData Scientist: Artificial Intelligence ... As a Data Scientist at IBM, you will help tran...
1 AI and Machine Learning ... A machine learning
engineer (ML engineer) focu...
2 newGraduate Intern - Technical ... DPEA enables that data center which is the und...
3 Artificial Intelligence & Machine Learning Expert ... Define and drive projects in AI and Machine Le...
4 newML Data Associate I ... Good familiarity with the Windows desktop envi...
.. ... ...
...
70 newData Scientist ... Perform data analysis and modelling on data se...
71 AI, Informatics & ML – Research Scientist ... Years of experience 2-4 yrs.Key Responsibiliti...
72 Software Development ... Software Developers at IBM are the backbone of...
73 newB2B/EDI - Map Development Specialist ... Software Developers at IBM are the backbone of...
74 Artificial Intelligence / Data Science/ Machin... ... TATA ELXSI Ltd. is conducting off
campus drive...
[75 rows x 4 columns]