Home > Enterprise >  Selenium web scraping of Indeed: Script returns raise TimeoutException
Selenium web scraping of Indeed: Script returns raise TimeoutException

Time:12-19

I am currently working on creating a script to web scrape job postings on Indeed that will capture title, company, location and job description. Currently my script will iterate through the first five pages and print out a dataframe of each. However, my dataframe for Page 2 will only include 3 of the 15 job postings. I think this might be due to the pop up box that shows up asking for your email. In order to address this, I tried incorporating a .click to exit out of the popout. Unfortunately, this caused a return of "Timeout Exception". I added in element = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CLASS_NAME, "popover-x-button-close icl-CloseButton"))) hoping that it would fix the issue, but no dice so far. Additionally, when I export to CSV, the only page of results that gets put into the CSV is page 5. I've included my code below. My apologies if these are very straightforward problems, I only started learning Python in order to do job code research three days ago. Thank you in advance!

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
    driver.implicitly_wait(5)

    jobtitles = []
    companies = []
    locations = []
    descriptions = []



    jobs = driver.find_elements_by_class_name("slider_container")

    for job in jobs:

            jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
            jobtitles.append(jobtitle)
            company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
            companies.append(company)
            location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
            locations.append(location)
            description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
            descriptions.append(description)
            element = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CLASS_NAME, "popover-x-button-close icl-CloseButton")))
            close_popup = driver.find_element_by_class_name("popover-x-button-close icl-CloseButton")
            close_popup.click()



    df_da=pd.DataFrame()
    df_da['JobTitle']=jobtitles
    df_da['Company']=companies
    df_da['Location']=locations
    df_da['Description']=descriptions
    print(df_da)
    df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')

CodePudding user response:

There are several issues here:

  1. The pop-up appears only once, only on the second page while you are waiting for this element each loop iteration. You should check if this element appears and only if it appeared to click it. Otherwise just pass.
  2. This element has several class name attributes. So you should use CSS Selector or XPath to locate it, not by_class_name since this method is accepting a single class name, not a sequence of class names separated by spaces.
  3. You can use click() method directly on the returned by WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))) element, no need to get this element again with driver.find_element

I suggest something like the folllowing:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
    driver.implicitly_wait(5)

    jobtitles = []
    companies = []
    locations = []
    descriptions = []



    jobs = driver.find_elements_by_class_name("slider_container")

    for job in jobs:

            jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
            jobtitles.append(jobtitle)
            company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
            companies.append(company)
            location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
            locations.append(location)
            description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
            descriptions.append(description)
            try:
                WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
            except:
                pass



    df_da=pd.DataFrame()
    df_da['JobTitle']=jobtitles
    df_da['Company']=companies
    df_da['Location']=locations
    df_da['Description']=descriptions
    print(df_da)
    df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
  • Related