Home > Software design >  Selenium - How can it be used to retrieve data from indeed.com despite the strange behaviour when th
Selenium - How can it be used to retrieve data from indeed.com despite the strange behaviour when th

Time:06-16

When using Selenium to retrieve data from indeed.com, I have noticed a strange behaviour that I still cannot explain.

Introduction

The urls have the following format

You can browse different pages of results.

When you open a second page (does not matter where you start from), a popup will appear, proposing you to register to the newsletter.

The popup

The strange, unexpected behaviour

Whenever I scrape my second page, and the popup appears, I can only scrape a portion of the results.

My first assumption:

  • Maybe part of the results is hidden / not loaded when the popup appears But looking at the page with Chrome, I could confirm that the page contains all of the data I need

Second assumption:

  • The page needs more time to load But increasing the waiting time did not solve the issue

I do not understand what is happening, could you help?

My Code

# Get the driver
driver = get_driver("YOURPATHTO/chromedriver")
driver.implicitly_wait(5)

url_indeed = lambda x: f"https://it.indeed.com/jobs?q=Call Center&sort=date&start={x}"

list_jobs = []

# Let's get the first 10 pages
for i in range(0, 1000, 10):
    current_jobs = []
 
    # Get the page
    driver.get(url_indeed(i))

    for counter in range(len(jobs)):
        job = jobs[counter]
        dictio = {}
        print("___ ___ ___")
        print(job.text) # Debug
        search1 = job.find_elements_by_xpath(".//div[contains(@class, 'topLeft')]")
        search2 = job.find_elements_by_xpath(".//span[contains(@id,'jobTitle')]")
        search3 = job.find_elements_by_xpath(".//span[@class='companyName']")
        search4 = job.find_elements_by_xpath(".//div[@class='companyLocation']")

        dictio["extra"] = search1[0].text
        dictio["work"] = search2[0].text
        dictio["company"] = search3[0].text
        dictio["place"] = search4[0].text

        if dictio["company"] == '':
            print(":(") # Debug
            pass
        
        current_jobs.append(dictio)

    print(current_jobs)
    print(len(current_jobs))
    list_jobs.extend(current_jobs)

My output (at the second iteration of the loop)

current_jobs

The expected output...

There should be no results missing like this. It is almost like there is the expected HTML but with no text inside of it.

No jobs should be missing

CodePudding user response:

Selenium allows you to enter image description here

This is how it's handled:

# On the second page - i.e. when i has increase by the increment
    # This will run the first time the page is incremented
    #this only needs to be done once
    if i==pageSteps:
        driver.find_element(By.XPATH, '//div[@]//button[@aria-label="Close"]').click()
        closeButtonXpath = "//div[@id='popover-x']/button"
        wait.until(EC.element_to_be_clickable((By.XPATH, closeButtonXpath))).click()
        wait.until(EC.invisibility_of_element((By.XPATH, closeButtonXpath)))

It only needs to run once, on the second page (when i has been incremented by pageSteps).

#######################################

This is everything put together:


from webbrowser import Chrome
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


# Get the driver
driver = webdriver.Chrome() # note i modified this to my driver
waitTime = 10
driver.implicitly_wait(waitTime)
## selenium docs saynot to mix both waits as you CAN get unprecitable waits 
## but the same time is OK - it's 
wait = WebDriverWait(driver, waitTime) 

url_indeed = lambda x: f"https://it.indeed.com/jobs?q=Call Center&sort=date&start={x}"

list_jobs = []

pageStart = 0 # the starting page
pageRange = 3 # num of pages. 5 == first 5 pages - lowered to debug
pageSteps = 10 # num of results to increment
for i in range(0, (pageRange * pageSteps), pageSteps):
    current_jobs = []
 
    # Get the page
    driver.get(url_indeed(i))


    # On the second page - i.e. when i has increase by the increment
    # This will run the first time the page is incremented
    #this only needs to be done once
    if i==pageSteps:
        driver.find_element(By.XPATH, '//div[@]//button[@aria-label="Close"]').click()
        closeButtonXpath = "//div[@id='popover-x']/button"
        wait.until(EC.element_to_be_clickable((By.XPATH, closeButtonXpath))).click()
        wait.until(EC.invisibility_of_element((By.XPATH, closeButtonXpath)))

    # I've added this line - i assume this is what you're doing?
    jobs = driver.find_elements(By.XPATH, "//div[@class='job_seen_beacon']")

    for counter in range(len(jobs)):
        job = jobs[counter]
        dictio = {}
        #print("___ ___ ___")
        #print(job.text) # Debug  - removed to create a clean output for the response
        search1 = job.find_elements_by_xpath(".//div[contains(@class, 'topLeft')]")
        search2 = job.find_elements_by_xpath(".//span[contains(@id,'jobTitle')]")
        search3 = job.find_elements_by_xpath(".//span[@class='companyName']")
        search4 = job.find_elements_by_xpath(".//div[@class='companyLocation']")

        dictio["extra"] = search1[0].text
        dictio["work"] = search2[0].text
        dictio["company"] = search3[0].text
        dictio["place"] = search4[0].text

        if dictio["company"] == '':
            print(":(") # Debug
            pass
        
        current_jobs.append(dictio) 

    print(current_jobs)
    print(len(current_jobs))
    list_jobs.extend(current_jobs)

This is the output:

[{'extra': 'nuova offerta', 'work': 'Operatori Call Center Inbound e Outbound', 'company': 'PrestitoSì Finance', 'place': 'Milano, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Operatore outbound Smart Working', 'company': 'Elite', 'place': 'Da remoto in 81030 Teverola'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center PART-TIME - 500 euro Mensili', 'company': 'GE.SAR', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center FULL-TIME - 800 euro Mensili', 'company': 'GE.SAR', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'ASSISTENZA CLIENTI', 'company': 'Rizdan Job', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'MANAGER DI CALL CENTER FIRENZE', 'company': 'R1S S.r.l.', 'place': 'Firenze Centro, Toscana'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico', 'company': 'Refcons', 'place': 'Orta di Atella, Campania\n 2 luoghi'}, 
{'extra': 'nuova offerta', 'work': 'TEAM LEADER - RESPONSABILE CALL CENTER', 'company': 'CHRIMAR SRLS', 'place': 'Parma, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Cerchiamo una venditrice di spazi pubblicitari', 'company': 'Lime Edizioni srl Milano', 'place': 'Corbetta, Lombardia'}, {'extra': 'nuova offerta', 'work': 'call center outbound', 'company': 'Jonio Comunicazioni S.r.l.', 'place': "Da remoto in 95131 Sant'Agata li Battiati"}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico', 'company': '24 MAGGIO TELEFONIA', 'place': '80016 Marano di Napoli'}, {'extra': 'nuova offerta', 'work': 'Operatore Back Office- ROMA SUD', 'company': 'SAGRES SRL', 'place': '00144 Roma'}, {'extra': 'nuova offerta', 'work': 'Apprendista commesso', 'company': 'Tommi srl', 'place': 'Empoli, Toscana\n 2 luoghi'}, {'extra': 'nuova offerta', 'work': 'Architetto', 'company': 'FACILE RISTRUTTURARE S.p.a.', 'place': 'Milano, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Addetta/o call center outbound', 'company': 'Fidani S.r.l', 'place': '00142 Roma'}]
15
[{'extra': 'nuova offerta', 'work': 'Operatore telemarketing - No vendita', 'company': 'AVC Utility Services', 'place': '21047 Saronno'}, {'extra': 'nuova offerta', 'work': 'TEAM LEADER CALL CENTER', 'company': 'ServiceHub srls', 'place': '80019 Qualiano\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'Capo Cantiere Automazione Industriale (018367)', 'company': 'Hunters Group S.r.l.', 'place': 'Rimini, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'IMPIEGATO COMMERCIALE VENDITA TELEFONICA OPERATORE OUTBOUND', 'company': 'MEDIAFIVE SRL', 'place': '10141 Torino'}, {'extra': 'nuova offerta', 'work': 'SEGRETARIA/CALL CENTER CATEGORIE PROTETTE', 'company': 'ETICA LAVORO Srl', 'place': 'Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CUSTOMER CARE SETTORE TELEMATICO-ASSICURATIVO LING...', 'company': 'Randstad Italia', 'place': 'Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico Web sales/Upsales', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Operatore Telefonico Web Sales settore Telecomunicazioni', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico ramo aziende', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Sales Account - Inbound / Outband | Commodities No Food', 'company': 'Page Personnel Italia', 'place': 'Lecco, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Operatrice di call center', 'company': 'associazione JKT', 'place': '20153 Milano'}, {'extra': 'nuova offerta', 'work': 'IMPIEGATO ADDETTO AL TELEMARKETING', 'company': 'LinkLab srl', 'place': 'Trento, Trentino-Alto Adige'}, {'extra': 'nuova offerta', 'work': 'operatore call center inbound', 'company': 'Randstad', 'place': 'Da remoto in Rende, Calabria\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'OPERATORE TELEFONICO_INSERIMENTO IMMEDIATO', 'company': 'Mercurycall', 'place': 'Andria, Puglia'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center Part Time', 'company': 'We Can Consulting', 'place': '90135 Palermo'}]
15
[{'extra': 'nuova offerta', 'work': 'OPERATORE ASSISTENZA CLIENTI TELEFONICA - INBOUND', 'company': 'Etjca S.p.a.', 'place': 'Rende, Calabria'}, {'extra': 'nuova offerta', 'work': 'Addetto/a al Customer service', 'company': 'Adecco Italia', 'place': 'Lainate, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Stage Customer Service', 'company': 'Adecco Italia', 'place': 'Segrate, Lombardia\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'Receptionist - lingua tedesca', 'company': 'Adecco Italia', 'place': 'Chioggia, Veneto'}, {'extra': 'nuova offerta', 'work': 'Accettatore clienti in officina', 'company': 'Adecco Italia', 'place': 'Torino, Piemonte'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CUSTOMER SERVICE', 'company': 'OpenjobMetis', 'place': 'Catanzaro, Calabria\n 5 luoghi'}, {'extra': 'nuova offerta', 'work': 'Back Office - L.68/99 Lucca', 'company': 'Adecco Italia', 'place': 'Lucca, Toscana\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'addetto/a call center - categoria protetta legge 68/99', 'company': 'Randstad', 'place': 'Calderara di Reno, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Operatore call center outbound', 'company': 'GIERRE CONTACT', 'place': 'Da remoto in Brindisi, Puglia\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'Smartworking Call Center Teleselling', 'company': 'Apophis s.r.l.', 'place': 'Da remoto in Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico da remoto', 'company': 'SGM DISTRIBUZIONE SRL', 'place': 'Da remoto in Torino, Piemonte'}, {'extra': 'nuova offerta', 'work': 'CALL CENTER MANAGER PROVINCIA DI MILANO', 'company': 'R1S S.r.l.', 'place': 'Cinisello Balsamo, Lombardia'}, {'extra': 'nuova offerta', 'work': 'operatore call center inbound e gestione canali digital', 'company': 'Msc srl', 'place': 'Reggio Emilia, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Operatori Call Center Sede Triggiano e Sede Bari fisso 650', 'company': 'Apophis s.r.l.', 'place': 'Bari, Puglia'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CALL CENTER PART-TIME', 'company': 'L&C', 'place': 'Roma, Lazio'}]
15
  • Related