I'm trying to scrape a website. I want to be able to retrieve a URL link from this webpage and use it to get to another page wherein I can access this information that I need.
import requests
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
baseUrl = 'https://elitejobstoday.com/'
url = "https://elitejobstoday.com/"
r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "lxml")
table = soup.find_all("a", attrs = {"class": "job-details-link"})
This part works fine however the next part is where I get stuck.
def jobScan(link):
the_job = {}
jobUrl = '{}{}'.format(baseUrl, link['href'])
the_job['urlLink'] = jobUrl
job = requests.get(jobUrl, headers = headers )
jobC = job.content
jobSoup = BeautifulSoup(jobC, "lxml")
name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
title = name.a.text
the_job['title'] = title
company = jobSoup.find_all("span", {"class": "job-company"})[0]
company = company.text
the_job['company'] = company
print(the_job)
return the_job
jobScan(table)
I'm getting this error:
"File "C:\Users\MUHUMUZA IVAN\Desktop\JobPortal\absa.py", line 41, in jobScan
jobUrl = '{}{}'.format(baseUrl, link['href'])
TypeError: list indices must be integers or slices, not str "
I'm clearly doing something wrong but i can't see it. I need your help. thanks.
CodePudding user response:
There are two main issues:
You are not iterating the
ResultSet
of urls, you pushtable
as list of urls to your function.Your urls become invalid, while prepending
baseUrl
, just try to usejobUrl = link['href']
cause path is absolute.
Note You also should check if the elements you are looking for exists in the responses
Example
Iterates over the first two urls - Third will give you an error, cause there is no <h3>
in response, but this should be asked in new question with exact this focus:
def jobScan(link):
the_job = {}
jobUrl = link['href']
print(jobUrl)
the_job['urlLink'] = jobUrl
job = requests.get(jobUrl, headers = headers )
jobC = job.content
jobSoup = BeautifulSoup(jobC, "lxml")
name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
title = name.a.text
the_job['title'] = title
company = jobSoup.find_all("span", {"class": "job-company"})[0]
company = company.text
the_job['company'] = company
return the_job
data = []
for a in table[:2]:
data.append(jobScan(a))
data
Output
[{'urlLink': 'https://elitejobstoday.com/jobs/office-assistant-ngo-careers-at-world-vision-uganda/',
'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
'company': ' World Vision Uganda\n'},
{'urlLink': 'https://elitejobstoday.com/jobs/survey-enumerators-41-positions-ngo-careers-at-catholic-relief-services-2022/',
'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
'company': ' Catholic Relief Services (CRS)\n'}]
CodePudding user response:
May be because table
is a list of links, not one link ?