How to speed up scraping with Selenium (multiprocessing)-CodePudding

I am trying to scrape a singular data point from a list of urls to dynamically loaded sites. I have implemented a scraper with selenium, but it is too slow. I tried using scrapy but realized scrapy does not work with dynamically loaded sites. I have seen documentation on splash with scrapy - but this seems to be the case where splash loads one dynamic site and scrapy parses the data from the one site; I have a huge list of urls. I am considering using mutliprocessing but unsure where to get started/if it would work well with selenium.

def get_cost(url):
driver.get(url)
try:
    element = WebDriverWait(driver, 4).until(
        EC.presence_of_element_located((By.XPATH,'/html/body/c-wiz[2]/div/div[2]/c-wiz/div/c-wiz/c-wiz/div[2]/div[2]/ul[1]/li[1]/div/div[2]/div/div[9]/div[2]/span'))
    )
    cost = element.get_attribute('textContent')
except:
    cost = "-"
finally:
    driver.quit()
return cost

This is a function that given a url, grabs the cheapest flight cost on the site. I am very new to web scraping so I would appreciate some advice with the best way to move forward.

CodePudding user response：

This script uses threading (instead of multiprocessing) to open multiple independent windows (instances) of the browser. This means that the code contained in the function get_cost is run simultaneously in each window.

Since each thread opens a new window, if you have many urls you should open only a small number of urls at a time, for example 10 urls at a time, otherwise the computer may freeze

from selenium import webdriver
import threading

def get_cost(url, costs):

    driver = ...
    driver.get(url)
    try:
        element = WebDriverWait(driver, 4).until(
            EC.presence_of_element_located((By.XPATH,'/html/body/c-wiz[2]/div/div[2]/c-wiz/div/c-wiz/c-wiz/div[2]/div[2]/ul[1]/li[1]/div/div[2]/div/div[9]/div[2]/span'))
        )
        cost = element.get_attribute('textContent')
    except:
        cost = "-"
    finally:
        driver.quit()
    costs.append(cost)

thread_list = []
costs = []
urls = ['...', '...', '...'] # each one is opened in a separate browser)

for idx, url in enumerate(urls):
    t = threading.Thread(name=f'Thread {idx}', target=get_cost, args=(url, costs))
    t.start()
    print(t.name   ' started')
    thread_list.append(t)

# wait for all threads to complete
for thread in thread_list:
    thread.join()

print(costs)