I was following along with this video that demonstrates web-scrapping indeed.com to get data into a CSV file. https://www.youtube.com/watch?v=eN_3d4JrL_w when I followed step by step I got results for a senior accountant in Charlotte, NC. yet when I tried to run the main again this time for data analyst and remote it kept giving me the same results...I tried to even manually set up the URL to give me the needed result but it kept giving me the same results. so I tried to get the main bulk and start a new kernel with it and it threw an error that records were not defined.. though it was. I'm new to web scrapping and still learning. I'm sure it's something obvious that I just can't see...I would appreciate your help
My Code: # Improting the needed libraries import csv import requests from datetime import datetime from bs4 import BeautifulSoup
#getting the url
def get_url (position,location):
templete = 'https://www.indeed.com/jobs?q={}&l={}'
url = templete.format(position,location)
return url
#getting the record
def get_record(card):
atag1=card.h2.a.span
job_title = atag1.get('title')
atag2= card.h2.a
job_url='https://indeed.com' atag2.get('href')
company =card.find('span','companyName').text.strip()
Location = card.find('div','companyLocation').text.strip()
summary =card.find('div','job-snippet').text.strip()
posted_date = card.find('span','date').text.strip()
Today = datetime.today().strftime('%Y-%m-%d')
try:
salary = card.find('div','metadata estimated-salary-container').text.strip()
except AttributeError:
salary = ''
record = (job_title, job_url , Location ,company ,posted_date ,Today,summary, salary)
return record
#writing the main function
def main(position,location):
records = []
url = get_url(position,location)
while True:
response=requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
cards=soup.find_all('div','job_seen_beacon')
for card in cards:
record=get_record(card)
records.append(record)
try:
url='https://indeed.com' soup.find('a',{'aria-label':'Next'}).get('href')
except AttributeError:
break
#save the results
with open('dataanalyst.csv','w',newline='',encoding= 'utf-8') as f:
writer= csv.writer(f)
writer.writerow(['JobTitle','Location','PostDate','ExtractionDate',
'Summary','Salary','JobUrl'])
writer.writerows(records)
CodePudding user response:
You somehow managed to take a relatively simple task (scraping jobs from indeed) and transform it into something unnecessarily complex, with function over function. Understandably, you ended up getting lost in this complexity, and you messed up the spacing/indentation (and probably the imports). Maybe it's ok as a learning exercise, or maybe not, as something complex will get you lost and discourage you from further learning. The idea is - learn to walk before learning to run. Below you can find the corrected code (I kept all your functions, just fixed the indentations, and also changed the filename to something relevant to the job search):
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
#getting the url
def get_url (position,location):
templete = 'https://www.indeed.com/jobs?q={}&l={}'
url = templete.format(position,location)
return url
#getting the record
def get_record(card):
atag1=card.h2.a.span
job_title = atag1.get('title')
atag2= card.h2.a
job_url='https://indeed.com' atag2.get('href')
company =card.find('span','companyName').text.strip()
Location = card.find('div','companyLocation').text.strip()
summary =card.find('div','job-snippet').text.strip()
posted_date = card.find('span','date').text.strip()
Today = datetime.today().strftime('%Y-%m-%d')
try:
salary = card.find('div','metadata estimated-salary-container').text.strip()
except AttributeError:
salary = ''
record = (job_title, job_url , Location ,company ,posted_date ,Today,summary, salary)
return record
#writing the main function
def main(position,location):
records = []
url = get_url(position,location)
while True:
response=requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
cards=soup.find_all('div','job_seen_beacon')
for card in cards:
record=get_record(card)
records.append(record)
try:
url='https://indeed.com' soup.find('a',{'aria-label':'Next'}).get('href')
except AttributeError:
break
#save the results
with open(f'{position}-{location}.csv','w',newline='',encoding= 'utf-8') as f:
writer= csv.writer(f)
writer.writerow(['JobTitle','Location','PostDate','ExtractionDate',
'Summary','Salary','JobUrl'])
writer.writerows(records)
main('business manager', 'Geneva')
When you run it, it will create a csv file named after your job title and location search, 'business manager-Geneva.csv' in this case, looking like this:
JobTitle Location PostDate ExtractionDate Summary Salary JobUrl
Manager Global Business Coordination https://indeed.com/rc/clk?jk=6b7b72bad220482b&... Hybrid remote in Aurora, IL ALDI PostedPosted 2 days ago 2022-07-22 Identifying opportunities for business improve... NaN
Operations Business Manager https://indeed.com/rc/clk?jk=abccb02951be5d36&... Elgin, IL 60123 John B. Sanfilippo PostedPosted 2 days ago 2022-07-22 Keeps a close eye on business and economic tre... Estimated 62.2