Update1:
If I change the code inside the for loop to:
print('processing new page')
pool.apply_async(time.sleep, (5,))
I see 5 sec delay after Every printing, so the problem isn't related to webdriver.
Update2:
Thanks for @user56700 but I'm interested in knowing what did I do here and how to fix without switching from the way I'm using threads.
In python I have the following code:
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
try:
print('processing new page')
result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
# Change global variables
except Exception as e:
log_warning(str(e))
If I have 10 pages the above code needs 300 seconds to finish which is a lot.
I read about something called threading in python: https://stackoverflow.com/a/15144765/19500354 so I wanted to use it but I'm not sure if I'm doing it the right way.
Here's my try:
import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
try:
print('processing new page')
result = pool.apply_async(parse_page, (driver, url,)).get()
G_LOCK.acquire()
# Change global variables
G_LOCK.release()
except Exception as e:
log_warning(str(e))
pool.close()
pool.join()
# Here I want to make sure ALL threads have finished working before running the below code
Why is my implementation wrong? note I'm using same driver instance
I tried to print time next to processing new page
and I see:
[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page
Which means something is wrong as I would expect 1 sec diff nothing more. as all I'm doing is to change global variables.
CodePudding user response:
I just created a simple example to show case how I would solve it. You need to add your own code of course.
from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver
driver = webdriver.Chrome()
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []
def parse_page(driver, url):
driver.get(url)
data = driver.title
return(data)
with ThreadPoolExecutor(max_workers=10) as executor:
results = {executor.submit(parse_page, driver, url) for url in urls}
for result in as_completed(results):
your_data.append(result.result())
driver.close()
print(your_data)
Result:
['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']
If you want, you could use the webdriver as a context manager to avoid having to close it, like this:
with webdriver.Chrome() as driver:
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []
def parse_page(driver, url):
driver.get(url)
data = driver.title
return(data)
with ThreadPoolExecutor(max_workers=10) as executor:
results = {executor.submit(parse_page, driver, url) for url in urls}
for result in as_completed(results):
your_data.append(result.result())
print(your_data)
Example using the multiprocessing.pool
library:
from selenium import webdriver
from multiprocessing.pool import ThreadPool
with webdriver.Chrome() as driver:
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []
def parse_page(driver, url):
driver.get(url)
data = driver.title
return(data)
pool = ThreadPool(processes=10)
results = [pool.apply_async(parse_page, (driver, url)) for url in urls]
for result in results:
your_data.append(result.get())
print(your_data)
Result:
['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']