I am scraping pages of the Italian website publishing new laws (Gazzetta Ufficiale) to save the final page which holds the law text.
I have a loop that builds a list of the pages to download and am attaching a fully working cose sample which shows the problem I'm running in (the sample is not looped I am just doing two "gets".
What is the best way to handle the rare page which does not show the "Visualizza" (show) button but goes straight to the desired full text?
Hope the code is pretty self explanatory and commented. Thank you in advance and super happy 2022!
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome("/Users/bob/Documents/work/scraper/scrape_gu/chromedriver")
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario"
)
# this page has a "Visualizza" button, find it and click it.
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[@id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the "normal" result with the "Visualizza" button
bottoni[0].click() # now click it and this shows the desired final webpage
time.sleep(5) # just to see the "normal" desired result
# but unfortunately some pages directly get to the end result WITHOUT the "Visualizza" button.
# as an example see the following get
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario"
) # get a law page
time.sleep(
5
) # as you can see we are now on the final desired full page WITHOUT the Visualizza button
# hence the following code, identical to that above will fail and timeout
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[@id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the result
bottoni[0].click() # and this shows the desired final webpage
# and the program abends with the following message
# File "/Users/bob/Documents/work/scraper/scrape_gu/temp.py", line 33, in <module>
# bottoni = WebDriverWait(driver, 10).until(
# File "/Users/bob/opt/miniconda3/envs/scraping/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
# raise TimeoutException(message, screen, stacktrace)
# selenium.common.exceptions.TimeoutException: Message:
CodePudding user response:
Catch the exception with a try
and except
block - If there is no button extract the text directly - Handling Exeptions
...
urls = [
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario',
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario'
]
data = []
for url in urls:
driver.get(url)
try:
bottoni = WebDriverWait(driver,1).until(
EC.element_to_be_clickable(
(By.XPATH, '//input[@value="Visualizza"]')
)
)
bottoni.click()
except TimeoutException:
print('no bottoni -')
finally:
data.append(driver.find_element(By.XPATH, '//body').text)
driver.close()
print(data)
...
CodePudding user response:
First, using selenium for this task is overkill.
You'd be able to do the same thing using requests or aiohttp coupled with beautifulsoup, except that would be much faster and easier to code.
Now to get back to your question, there are a few solutions.
The simplest would be :
- Catch the timeout exception : if the button isn't found, then go straight to parsing the law.
- Check if the button is present :
!driver.findElements(By.id("corpo_export")).isEmpty()
, before either clicking on it, or parsing the web page.
But then again, you'd have a much easier time getting rid of selenium and using beautifulsoup instead.