When using Selenium to retrieve data from indeed.com, I have noticed a strange behaviour that I still cannot explain.
Introduction
The urls have the following format
When you open a second page (does not matter where you start from), a popup will appear, proposing you to register to the newsletter.
The strange, unexpected behaviour
Whenever I scrape my second page, and the popup appears, I can only scrape a portion of the results.
My first assumption:
- Maybe part of the results is hidden / not loaded when the popup appears But looking at the page with Chrome, I could confirm that the page contains all of the data I need
Second assumption:
- The page needs more time to load But increasing the waiting time did not solve the issue
I do not understand what is happening, could you help?
My Code
# Get the driver
driver = get_driver("YOURPATHTO/chromedriver")
driver.implicitly_wait(5)
url_indeed = lambda x: f"https://it.indeed.com/jobs?q=Call Center&sort=date&start={x}"
list_jobs = []
# Let's get the first 10 pages
for i in range(0, 1000, 10):
current_jobs = []
# Get the page
driver.get(url_indeed(i))
for counter in range(len(jobs)):
job = jobs[counter]
dictio = {}
print("___ ___ ___")
print(job.text) # Debug
search1 = job.find_elements_by_xpath(".//div[contains(@class, 'topLeft')]")
search2 = job.find_elements_by_xpath(".//span[contains(@id,'jobTitle')]")
search3 = job.find_elements_by_xpath(".//span[@class='companyName']")
search4 = job.find_elements_by_xpath(".//div[@class='companyLocation']")
dictio["extra"] = search1[0].text
dictio["work"] = search2[0].text
dictio["company"] = search3[0].text
dictio["place"] = search4[0].text
if dictio["company"] == '':
print(":(") # Debug
pass
current_jobs.append(dictio)
print(current_jobs)
print(len(current_jobs))
list_jobs.extend(current_jobs)
My output (at the second iteration of the loop)
The expected output...
There should be no results missing like this. It is almost like there is the expected HTML but with no text inside of it.
CodePudding user response:
This is how it's handled:
# On the second page - i.e. when i has increase by the increment
# This will run the first time the page is incremented
#this only needs to be done once
if i==pageSteps:
driver.find_element(By.XPATH, '//div[@]//button[@aria-label="Close"]').click()
closeButtonXpath = "//div[@id='popover-x']/button"
wait.until(EC.element_to_be_clickable((By.XPATH, closeButtonXpath))).click()
wait.until(EC.invisibility_of_element((By.XPATH, closeButtonXpath)))
It only needs to run once, on the second page (when i
has been incremented by pageSteps
).
#######################################
This is everything put together:
from webbrowser import Chrome
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
# Get the driver
driver = webdriver.Chrome() # note i modified this to my driver
waitTime = 10
driver.implicitly_wait(waitTime)
## selenium docs saynot to mix both waits as you CAN get unprecitable waits
## but the same time is OK - it's
wait = WebDriverWait(driver, waitTime)
url_indeed = lambda x: f"https://it.indeed.com/jobs?q=Call Center&sort=date&start={x}"
list_jobs = []
pageStart = 0 # the starting page
pageRange = 3 # num of pages. 5 == first 5 pages - lowered to debug
pageSteps = 10 # num of results to increment
for i in range(0, (pageRange * pageSteps), pageSteps):
current_jobs = []
# Get the page
driver.get(url_indeed(i))
# On the second page - i.e. when i has increase by the increment
# This will run the first time the page is incremented
#this only needs to be done once
if i==pageSteps:
driver.find_element(By.XPATH, '//div[@]//button[@aria-label="Close"]').click()
closeButtonXpath = "//div[@id='popover-x']/button"
wait.until(EC.element_to_be_clickable((By.XPATH, closeButtonXpath))).click()
wait.until(EC.invisibility_of_element((By.XPATH, closeButtonXpath)))
# I've added this line - i assume this is what you're doing?
jobs = driver.find_elements(By.XPATH, "//div[@class='job_seen_beacon']")
for counter in range(len(jobs)):
job = jobs[counter]
dictio = {}
#print("___ ___ ___")
#print(job.text) # Debug - removed to create a clean output for the response
search1 = job.find_elements_by_xpath(".//div[contains(@class, 'topLeft')]")
search2 = job.find_elements_by_xpath(".//span[contains(@id,'jobTitle')]")
search3 = job.find_elements_by_xpath(".//span[@class='companyName']")
search4 = job.find_elements_by_xpath(".//div[@class='companyLocation']")
dictio["extra"] = search1[0].text
dictio["work"] = search2[0].text
dictio["company"] = search3[0].text
dictio["place"] = search4[0].text
if dictio["company"] == '':
print(":(") # Debug
pass
current_jobs.append(dictio)
print(current_jobs)
print(len(current_jobs))
list_jobs.extend(current_jobs)
This is the output:
[{'extra': 'nuova offerta', 'work': 'Operatori Call Center Inbound e Outbound', 'company': 'PrestitoSì Finance', 'place': 'Milano, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Operatore outbound Smart Working', 'company': 'Elite', 'place': 'Da remoto in 81030 Teverola'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center PART-TIME - 500 euro Mensili', 'company': 'GE.SAR', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center FULL-TIME - 800 euro Mensili', 'company': 'GE.SAR', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'ASSISTENZA CLIENTI', 'company': 'Rizdan Job', 'place': '81100 Caserta'}, {'extra': 'nuova offerta', 'work': 'MANAGER DI CALL CENTER FIRENZE', 'company': 'R1S S.r.l.', 'place': 'Firenze Centro, Toscana'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico', 'company': 'Refcons', 'place': 'Orta di Atella, Campania\n 2 luoghi'},
{'extra': 'nuova offerta', 'work': 'TEAM LEADER - RESPONSABILE CALL CENTER', 'company': 'CHRIMAR SRLS', 'place': 'Parma, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Cerchiamo una venditrice di spazi pubblicitari', 'company': 'Lime Edizioni srl Milano', 'place': 'Corbetta, Lombardia'}, {'extra': 'nuova offerta', 'work': 'call center outbound', 'company': 'Jonio Comunicazioni S.r.l.', 'place': "Da remoto in 95131 Sant'Agata li Battiati"}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico', 'company': '24 MAGGIO TELEFONIA', 'place': '80016 Marano di Napoli'}, {'extra': 'nuova offerta', 'work': 'Operatore Back Office- ROMA SUD', 'company': 'SAGRES SRL', 'place': '00144 Roma'}, {'extra': 'nuova offerta', 'work': 'Apprendista commesso', 'company': 'Tommi srl', 'place': 'Empoli, Toscana\n 2 luoghi'}, {'extra': 'nuova offerta', 'work': 'Architetto', 'company': 'FACILE RISTRUTTURARE S.p.a.', 'place': 'Milano, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Addetta/o call center outbound', 'company': 'Fidani S.r.l', 'place': '00142 Roma'}]
15
[{'extra': 'nuova offerta', 'work': 'Operatore telemarketing - No vendita', 'company': 'AVC Utility Services', 'place': '21047 Saronno'}, {'extra': 'nuova offerta', 'work': 'TEAM LEADER CALL CENTER', 'company': 'ServiceHub srls', 'place': '80019 Qualiano\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'Capo Cantiere Automazione Industriale (018367)', 'company': 'Hunters Group S.r.l.', 'place': 'Rimini, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'IMPIEGATO COMMERCIALE VENDITA TELEFONICA OPERATORE OUTBOUND', 'company': 'MEDIAFIVE SRL', 'place': '10141 Torino'}, {'extra': 'nuova offerta', 'work': 'SEGRETARIA/CALL CENTER CATEGORIE PROTETTE', 'company': 'ETICA LAVORO Srl', 'place': 'Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CUSTOMER CARE SETTORE TELEMATICO-ASSICURATIVO LING...', 'company': 'Randstad Italia', 'place': 'Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico Web sales/Upsales', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Operatore Telefonico Web Sales settore Telecomunicazioni', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico ramo aziende', 'company': 'H2Com', 'place': 'Da remoto in 00164 Roma'}, {'extra': 'nuova offerta', 'work': 'Sales Account - Inbound / Outband | Commodities No Food', 'company': 'Page Personnel Italia', 'place': 'Lecco, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Operatrice di call center', 'company': 'associazione JKT', 'place': '20153 Milano'}, {'extra': 'nuova offerta', 'work': 'IMPIEGATO ADDETTO AL TELEMARKETING', 'company': 'LinkLab srl', 'place': 'Trento, Trentino-Alto Adige'}, {'extra': 'nuova offerta', 'work': 'operatore call center inbound', 'company': 'Randstad', 'place': 'Da remoto in Rende, Calabria\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'OPERATORE TELEFONICO_INSERIMENTO IMMEDIATO', 'company': 'Mercurycall', 'place': 'Andria, Puglia'}, {'extra': 'nuova offerta', 'work': 'Operatore Call Center Part Time', 'company': 'We Can Consulting', 'place': '90135 Palermo'}]
15
[{'extra': 'nuova offerta', 'work': 'OPERATORE ASSISTENZA CLIENTI TELEFONICA - INBOUND', 'company': 'Etjca S.p.a.', 'place': 'Rende, Calabria'}, {'extra': 'nuova offerta', 'work': 'Addetto/a al Customer service', 'company': 'Adecco Italia', 'place': 'Lainate, Lombardia'}, {'extra': 'nuova offerta', 'work': 'Stage Customer Service', 'company': 'Adecco Italia', 'place': 'Segrate, Lombardia\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'Receptionist - lingua tedesca', 'company': 'Adecco Italia', 'place': 'Chioggia, Veneto'}, {'extra': 'nuova offerta', 'work': 'Accettatore clienti in officina', 'company': 'Adecco Italia', 'place': 'Torino, Piemonte'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CUSTOMER SERVICE', 'company': 'OpenjobMetis', 'place': 'Catanzaro, Calabria\n 5 luoghi'}, {'extra': 'nuova offerta', 'work': 'Back Office - L.68/99 Lucca', 'company': 'Adecco Italia', 'place': 'Lucca, Toscana\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'addetto/a call center - categoria protetta legge 68/99', 'company': 'Randstad', 'place': 'Calderara di Reno, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Operatore call center outbound', 'company': 'GIERRE CONTACT', 'place': 'Da remoto in Brindisi, Puglia\n 1 luogo'}, {'extra': 'nuova offerta', 'work': 'Smartworking Call Center Teleselling', 'company': 'Apophis s.r.l.', 'place': 'Da remoto in Roma, Lazio'}, {'extra': 'nuova offerta', 'work': 'Operatore telefonico da remoto', 'company': 'SGM DISTRIBUZIONE SRL', 'place': 'Da remoto in Torino, Piemonte'}, {'extra': 'nuova offerta', 'work': 'CALL CENTER MANAGER PROVINCIA DI MILANO', 'company': 'R1S S.r.l.', 'place': 'Cinisello Balsamo, Lombardia'}, {'extra': 'nuova offerta', 'work': 'operatore call center inbound e gestione canali digital', 'company': 'Msc srl', 'place': 'Reggio Emilia, Emilia-Romagna'}, {'extra': 'nuova offerta', 'work': 'Operatori Call Center Sede Triggiano e Sede Bari fisso 650', 'company': 'Apophis s.r.l.', 'place': 'Bari, Puglia'}, {'extra': 'nuova offerta', 'work': 'OPERATORE CALL CENTER PART-TIME', 'company': 'L&C', 'place': 'Roma, Lazio'}]
15