Home > Software design >  How to speed up scraping with Selenium (multiprocessing)
How to speed up scraping with Selenium (multiprocessing)

Time:06-12

I am trying to scrape a singular data point from a list of urls to dynamically loaded sites. I have implemented a scraper with selenium, but it is too slow. I tried using scrapy but realized scrapy does not work with dynamically loaded sites. I have seen documentation on splash with scrapy - but this seems to be the case where splash loads one dynamic site and scrapy parses the data from the one site; I have a huge list of urls. I am considering using mutliprocessing but unsure where to get started/if it would work well with selenium.

def get_cost(url):
driver.get(url)
try:
    element = WebDriverWait(driver, 4).until(
        EC.presence_of_element_located((By.XPATH,'/html/body/c-wiz[2]/div/div[2]/c-wiz/div/c-wiz/c-wiz/div[2]/div[2]/ul[1]/li[1]/div/div[2]/div/div[9]/div[2]/span'))
    )
    cost = element.get_attribute('textContent')
except:
    cost = "-"
finally:
    driver.quit()
return cost

This is a function that given a url, grabs the cheapest flight cost on the site. I am very new to web scraping so I would appreciate some advice with the best way to move forward.

CodePudding user response:

This script uses threading (instead of multiprocessing) to open multiple independent windows (instances) of the browser. This means that the code contained in the function get_cost is run simultaneously in each window.

Since each thread opens a new window, if you have many urls you should open only a small number of urls at a time, for example 10 urls at a time, otherwise the computer may freeze

from selenium import webdriver
import threading

def get_cost(url, costs):

    driver = ...
    driver.get(url)
    try:
        element = WebDriverWait(driver, 4).until(
            EC.presence_of_element_located((By.XPATH,'/html/body/c-wiz[2]/div/div[2]/c-wiz/div/c-wiz/c-wiz/div[2]/div[2]/ul[1]/li[1]/div/div[2]/div/div[9]/div[2]/span'))
        )
        cost = element.get_attribute('textContent')
    except:
        cost = "-"
    finally:
        driver.quit()
    costs.append(cost)

thread_list = []
costs = []
urls = ['...', '...', '...'] # each one is opened in a separate browser)

for idx, url in enumerate(urls):
    t = threading.Thread(name=f'Thread {idx}', target=get_cost, args=(url, costs))
    t.start()
    print(t.name   ' started')
    thread_list.append(t)

# wait for all threads to complete
for thread in thread_list:
    thread.join()

print(costs)
  • Related