I am working on this browser automation project that performs some browser tasks in parallel. The idea is to:
- open four browsers
- do some tasks
- wait for all browsers to finish with the tasks before we close all browsers
Here's a simple web driver function for demo purposes.
# For initializing webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
def initialize_driver(starting_url: str = 'https://www.google.com/'):
''' Open a webdriver and go to Google
'''
# Webdriver option(s): keep webdriver opened
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
# Initialize webdriver
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=chrome_options)
# Open website; wait until fully loaded
driver.get(starting_url)
driver.implicitly_wait(10)
time.sleep(1)
return driver
Using this function, I can now create four jobs that will run in parallel using multiprocessing
.
# Import package
import multiprocessing as mp
# List of workers
workers = []
# Run in parallel
for _ in range(4):
worker = mp.Process(target=phm2.worker_bot_test)
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
These already covered the first two points, but as far as I know, we can only close a webdriver at a time using driver.close()
. Is there a way that we can close them all at once? I actually tried creating a list of webdrivers and appending it with a webdriver at the end of the function. Then, close them one by one. But for some reason, it isn't working.
# I added drivers.append(driver) at the end of the function from earlier
# This will now be a global variable to store the list of drivers
drivers = []
# Insert multiprocessing code here...
# Close all drivers
for driver in drivers:
driver.close()
What could I possibly try to do to achieve the last step? I've been seeing that we can tweak the Process
class to include return values (having return values would be a big help), but, as much as possible, I don't want to do that since it's kinda complex.
CodePudding user response:
Each webdriver
object is absolutely independent object instance.
In the same way as when you applying f.e. get()
method on some specific webdriver
object this has no influence on any other webdriver
object, similarly when you applying quit()
or close()
on some webdriver
object this will absolutely no influence on any other webdriver
object.
So, the only way to close ALL your webdriver
sessions is to keep all the webdriver
object in some structure, like list
etc.
And when you will need to close all the sessions is to iterate over that list and apply driver.quit()
on each and every objects in that list.
BTW, in order to clearly close the session you should use quit()
method, not close()
.
CodePudding user response:
I would first observe that since the selenium driver is already running as a child process you only really need to use multithreading. I am assuming that any work done by your threads after a web page and its elements have been retrieved is not particularly CPU-intensive. If this is not the case you can always create a multiprocessing pool that is passed to the worker_bot_test
worker function for executing any CPU-intensive operations in parallel.
By using threads we can create a class that creates the driver and has a __del__
finalizer that "quits" the driver when the class instance is garbage collected. We keep a reference to that class instance in thread local storage so that the finalizer is only called when the thread terminates and thread local storage is garbage collected. To ensure this garbage collection we can explicitly call gc.collect
after the child threads terminate. If we were using multiprocessing instead of multithreading, this call to gc.collect
would have no effect because it only garbage collects the current process.
# For initializing webdriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import threading
class ChromeDriver:
def __init__(self, starting_url):
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
# Not a bad option to add:
#chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
# If we don't need to see the browsers:
#chrome_options.add_argument("headless")
# Initialize webdriver
self.driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=chrome_options)
# Open website; wait until fully loaded
self.driver.get(starting_url)
self.driver.implicitly_wait(10)
# What is the purpose of the following line?
#time.sleep(1)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has been "quitted".')
threadLocal = threading.local()
def initialize_driver(starting_url: str = 'https://www.google.com/'):
chrome_driver = ChromeDriver(starting_url)
# Make sure there is a reference to the ChromeDriver instance so that
# it is not prematurely finalized:
threadLocal.driver = chrome_driver
return chrome_driver.driver
def worker_bot_test():
driver = initialize_driver()
print(len(driver.page_source))
if __name__ == '__main__':
# List of workers
workers = []
# Run in parallel
for _ in range(4):
worker = threading.Thread(target=worker_bot_test)
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
# Ensure finalizers are executed:
import gc
gc.collect()
Prints:
...
163036
163050
163183
165486
The driver has been "quitted".
The driver has been "quitted".
The driver has been "quitted".
The driver has been "quitted".