Home > OS >  IndexError when appending output form web scraping within a for loop
IndexError when appending output form web scraping within a for loop

Time:09-22

I am trying to web scrape glassdoor, but I get an IndexError. My code has the following form:

html = requests.get('https://www.glassdoor.com/Job/germany-data-science-jobs-SRCH_IL.0,7_IN96_KO8,20_IP1.htm?includeNoSalaryJobs=true', timeout = 5)

soup = BeautifulSoup(html.content, 'lxml')


# extracts the hyperlinks in each jobposting
link = []
for i in soup.find_all('div', class_ = 'd-flex flex-column pl-sm css-1buaf54 job-search-key-1mn3dn8 e1rrn5ka0'):
    li = 'https://www.glassdoor.com'   i.a['href']
    link.append(li)

# extracts the job descriptions by creating a new soup from each link extracted above
description = []
for links in link:
    page = requests.get(links, headers=headers)
    soup = BeautifulSoup(page.content, 'lxml')
    for job in soup.find_all('div', class_ = 'desc css-58vpdc ecgq1xb5')[0]:
        try:
            description.append(job.text.strip())
        except:
            description.append(None)

I want to extract the job description of all jobs, which are within div or p tags within the div tag of ('div', class_ = 'desc css-58vpdc ecgq1xb5'). When running the code I get the following error:

Traceback (most recent call last):
  File "C:\Users\aedan\PycharmProjects\Data_Science_Job_Openings\main.py", line 47, in <module>
    for job in soup.find_all('div', class_ = 'desc css-58vpdc ecgq1xb5')[0]:
IndexError: list index out of range

Process finished with exit code 1

I used try and except to append "nothing" to solve the error, as shown above, but it didn't work. I also used html.parser instead of lxml. I also tried the solution of this post how to fix error in BeautifulSoup IndexError: list index out of range, but I was not able to structure my code in that way to try it.

CodePudding user response:

As mentioned, use find() or 'select_one()' if you like to select only one element or check that there is an element in your ResultSet.

But also, if you use find() try to check the availability of the element you searched for and handle this case.

Example

import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
html = requests.get('https://www.glassdoor.com/Job/germany-data-science-jobs-SRCH_IL.0,7_IN96_KO8,20_IP1.htm?includeNoSalaryJobs=true', timeout = 5)
soup = BeautifulSoup(html.content, 'lxml')

data =[]

for url in ['https://www.glassdoor.com'   a.get('href') for a in soup.select('li[data-id] a:first-of-type')]:
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content)
    data.append({
        'url':url,
        'desc':None if not soup.find('div', {'id':'JobDescriptionContainer'}) else soup.find('div', {'id':'JobDescriptionContainer'}).get_text(strip=True)
    })

data
  • Related