Speedup web scraping with selenium-CodePudding

I am newbie to web scraping using selenium and I am scraping seetickets.us My scraper works as follows.

sign in
search for events
click on each event
scrape data
come back
click on next event
repeat

Now the problem is that some of the events do not contain some elements such as this event: https://wl.seetickets.us/event/Beta-Hi-Fi/484490?afflky=WorldCafeLive

which does not contain pricing table but this one does

https://www.seetickets.us/event/Wake-Up-Daisy-1100AM/477633

so I have used try except blocks

try:
   find element 
except:
   return none

but if it doesnt found the element in try, it takes 5 seconds to go to except because I have used

webdriver.implicitwait(5)

Now , if any page does not contain multiple elements , the selenium takes very much time to scrape that page.

I have thousands of pages to scrape. What should be done to speed up the process.

Thanks

CodePudding user response：

Instead of ImplicitWait try to use ExplicitWait but apply it to search of main container only to wait for content to be loaded. For all inner elements apply find_element with no waits.

P.S. It's always better to share your real code instead of pseudo-code

CodePudding user response：

To speedup web scraping using Selenium:

Remove implicitwait() totally.
Induce WebDriverWait to synchronise the webdriver instance with the WebBrowser instance for either of the following element states:

Your effective code block will be:

try:
   element = WebDriverWait(driver, 3).until(EC.visibility_of_element_located((By.ID, "input")))) 
   print("Element is visible")
except TimeoutException:
   print("Element is not visible")

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

CodePudding user response：

Instead of using implicitWait and waiting for each individual element, only wait for the full page load, for example wait for h1 tag, which will indicate the full page has been loaded then proceed with extraction.

#wait for page load
try:
    pageLoadCheck=WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "(//h1)[1]"))).get_attribute("textContent").strip()

#extract data without any wait once the page is loaded
   try:
       dataOne=driver.find_element_by_xpath("((//h1/following-sibling::div)[1]//a[contains(@href,'tel:')])[1]").get_attribute("textContent").strip()
   except:
        dataOne=''

except Exception as e:
    print(e)