Home > Blockchain >  Indeed Webscrape (Selenium): Script only returning one page of data frame into CSV/Long Run Time
Indeed Webscrape (Selenium): Script only returning one page of data frame into CSV/Long Run Time

Time:12-20

I am currently learning Python in order to webscrape and am running into an issue with my current script. After closing the pop-up on Page 2 of Indeed and cycling through the pages, the script only returns one page into the data frame to CSV. However, it does print out each page in my terminal area. It also on occasion only returns part of the data from a page. EX page 2 will return info for the first 3 jobs as part of my print(df_da), but nothing for the next 12. Additionally, it seems to take a very long time to run the script (averaging around 6 minutes and 45 seconds for the 5 pages, around 1 minute to 1.5 minutes per page). Any suggestions? I've attached my script and can also attach the return I get from my Print(df_da) if needed below. Thank you in advance!

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
    driver.implicitly_wait(5)

    jobtitles = []
    companies = []
    locations = []
    descriptions = []



    jobs = driver.find_elements_by_class_name("slider_container")

    for job in jobs:

        jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
        jobtitles.append(jobtitle)
        company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
        companies.append(company)
        location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
        locations.append(location)
        description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
        descriptions.append(description)
        try:
            WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
        except:
            pass



    df_da=pd.DataFrame()
    df_da['JobTitle']=jobtitles
    df_da['Company']=companies
    df_da['Location']=locations
    df_da['Description']=descriptions
    print(df_da)
    df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')

CodePudding user response:

You are defining the df_da inside the outer for loop so that the df_da will contain the data from the last page only.
You should define it out of the loops and put the total data there only after all the data have been collected.
I guess you are getting not all the jobs on the second page because of the pop-up. So, you should close it before collecting the job details on that page.
Also, you can reduce waiting for the pop-up element from all the loop iterations and leave it for the second loop iteration only.
Your code can be something like this:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = Options()
options.add_argument("window-size=1400,1400")

PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)

jobtitles = []
companies = []
locations = []
descriptions = []

for i in range(0,50,10):
    driver.get('https://www.indeed.com/jobs?q=chemical engineer&l=united states&start=' str(i))
    driver.implicitly_wait(5)

    jobs = driver.find_elements_by_class_name("slider_container")

    for idx, job in enumerate(jobs):
        if(idx == 1):
            try:
                WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
            except:
                pass

        jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
        jobtitles.append(jobtitle)
        company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
        companies.append(company)
        location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
        locations.append(location)
        description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
        descriptions.append(description)

df_da=pd.DataFrame()    
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
  • Related