Python WebScraping - Sleep oscillate in slow websites-CodePudding

I have a webscraping, but the site I'm using in some days is slow and sometimes not. Using the fixed SLEEP, it gives an error in a few days. How to fix this? I use SLEEP in the intervals of the tasks that I have placed, because the site is sometimes slow and does not return the result giving me an error.

from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox import options
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import Select
import pandas as pd
import json
from time import sleep 




options = Options()
options.headless = True

navegador = webdriver.Firefox(options = options)

link = '****************************'
navegador.get(url = link)
sleep(1)

usuario = navegador.find_element(by=By.ID, value='ctl00_ctl00_Content_Content_txtLogin')
usuario.send_keys('****************************')
sleep(1)

senha = navegador.find_element(by=By.ID, value='ctl00_ctl00_Content_Content_txtSenha')
senha.send_keys('****************************')
sleep(2.5)

botaologin = navegador.find_element(by=By.ID, value='ctl00_ctl00_Content_Content_btnEnviar')
botaologin.click()
sleep(40)

agendamento = navegador.find_element(by=By.ID, value='ctl00_ctl00_Content_Content_TreeView2t8')
agendamento.click()
sleep(2)

selecdia = navegador.find_element(By.CSS_SELECTOR, "a[title='06 de dezembro']")
selecdia.click()
sleep(2)

selecterminal = navegador.find_element(by=By.ID, value='ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa')
selecterminal.click()
sleep(1)

select = Select(navegador.find_element(by=By.ID, value='ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa'))
select.select_by_index(1)
sleep(10)

buscalink = navegador.find_elements(by=By.XPATH, value='//*[@id="divScroll"]')

for element in buscalink:
    teste3 = element.get_attribute('innerHTML')

soup = BeautifulSoup(teste3, "html.parser")


Vagas = soup.find_all(title="Vaga disponível.")

print(Vagas)

temp=[]
for i in Vagas:
    on_click = i.get('onclick')
    temp.append(on_click)


df = pd.DataFrame(temp)
df.to_csv('test.csv', mode='a', header=False, index=False)

It returns an error because the page does not load in time and it cannot get the data, but this time is variable

CodePudding user response：

Instead of all these hardcoded sleeps you need to use WebDriverWait expected_conditions explicit waits.
With it you can set some timeout period so Selenium will poll the page periodically until the expected condition is fulfilled.
For example if you need to click a button you will wait for that element clickability. Once this condition is found Selenium will return you that element and you will be able to click it.
This will reduce all the redundant delays on the one hand and will keep waiting until the condition is matched on the other hand (until it is inside the defined timeout).
So, your code can be modified as following:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

#-----
wait = WebDriverWait(navegador, 30)

navegador.get(link)

wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtLogin"))).send_keys('****************************')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtSenha"))).send_keys('****************************')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_btnEnviar"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_TreeView2t8"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[title='06 de dezembro']"))).click()

etc.