How to fix " TypeError: list indices must be integers or slices, not str. "?-CodePudding

I'm trying to scrape a website. I want to be able to retrieve a URL link from this webpage and use it to get to another page wherein I can access this information that I need.

import requests
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows 10; Win64; x64; rv:101.0.1) Gecko/20100101 Firefox/101.0.1'}
baseUrl = 'https://elitejobstoday.com/'
url = "https://elitejobstoday.com/"

r = requests.get(url, headers = headers)
c = r.content
soup = BeautifulSoup(c, "lxml")

table = soup.find_all("a",  attrs = {"class": "job-details-link"})

This part works fine however the next part is where I get stuck.

def jobScan(link):
     
    the_job = {}

    jobUrl = '{}{}'.format(baseUrl, link['href'])
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    print(the_job)

    return the_job

jobScan(table)

I'm getting this error:

"File "C:\Users\MUHUMUZA IVAN\Desktop\JobPortal\absa.py", line 41, in jobScan
    jobUrl = '{}{}'.format(baseUrl, link['href'])
TypeError: list indices must be integers or slices, not str "

I'm clearly doing something wrong but i can't see it. I need your help. thanks.

CodePudding user response：

There are two main issues:

You are not iterating the ResultSet of urls, you push table as list of urls to your function.
Your urls become invalid, while prepending baseUrl, just try to use jobUrl = link['href'] cause path is absolute.

Note You also should check if the elements you are looking for exists in the responses

Example

Iterates over the first two urls - Third will give you an error, cause there is no <h3> in response, but this should be asked in new question with exact this focus:

def jobScan(link):
     
    the_job = {}

    jobUrl = link['href']
    print(jobUrl)
    the_job['urlLink'] = jobUrl

    job = requests.get(jobUrl, headers = headers )
    jobC = job.content
    jobSoup = BeautifulSoup(jobC, "lxml")

    name = jobSoup.find("h3", attrs={"class": "loop-item-title"})
    title = name.a.text
    the_job['title'] = title

    company = jobSoup.find_all("span", {"class": "job-company"})[0]
    company = company.text
    the_job['company'] = company

    return the_job

data = []

for a in table[:2]:
    data.append(jobScan(a))

data

Output

[{'urlLink': 'https://elitejobstoday.com/jobs/office-assistant-ngo-careers-at-world-vision-uganda/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' World Vision Uganda\n'},
 {'urlLink': 'https://elitejobstoday.com/jobs/survey-enumerators-41-positions-ngo-careers-at-catholic-relief-services-2022/',
  'title': 'Project Accountant – Lego Foundation Playful Parenting Project (NGO Careers) at World Vision Uganda',
  'company': ' Catholic Relief Services (CRS)\n'}]

CodePudding user response：

May be because table is a list of links, not one link ?