I am newbie to web scraping using selenium and I am scraping seetickets.us My scraper works as follows.
- sign in
- search for events
- click on each event
- scrape data
- come back
- click on next event
- repeat
Now the problem is that some of the events do not contain some elements such as this event: https://wl.seetickets.us/event/Beta-Hi-Fi/484490?afflky=WorldCafeLive
which does not contain pricing table but this one does
https://www.seetickets.us/event/Wake-Up-Daisy-1100AM/477633
so I have used try except blocks
try:
find element
except:
return none
but if it doesnt found the element in try, it takes 5 seconds to go to except because I have used
webdriver.implicitwait(5)
Now , if any page does not contain multiple elements , the selenium takes very much time to scrape that page.
I have thousands of pages to scrape. What should be done to speed up the process.
Thanks
CodePudding user response:
Instead of ImplicitWait try to use ExplicitWait but apply it to search of main container only to wait for content to be loaded. For all inner elements apply find_element
with no waits.
P.S. It's always better to share your real code instead of pseudo-code
CodePudding user response:
To speedup web scraping using Selenium:
- Remove implicitwait() totally.
- Induce WebDriverWait to synchronise the webdriver instance with the WebBrowser instance for either of the following element states:
Your effective code block will be:
try:
element = WebDriverWait(driver, 3).until(EC.visibility_of_element_located((By.ID, "input"))))
print("Element is visible")
except TimeoutException:
print("Element is not visible")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
CodePudding user response:
Instead of using implicitWait and waiting for each individual element, only wait for the full page load, for example wait for h1 tag, which will indicate the full page has been loaded then proceed with extraction.
#wait for page load
try:
pageLoadCheck=WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "(//h1)[1]"))).get_attribute("textContent").strip()
#extract data without any wait once the page is loaded
try:
dataOne=driver.find_element_by_xpath("((//h1/following-sibling::div)[1]//a[contains(@href,'tel:')])[1]").get_attribute("textContent").strip()
except:
dataOne=''
except Exception as e:
print(e)