Python multiprocessing a class-CodePudding

I am trying to multiprocess selenium where each process is spawned with a selenium driver and a session (each process is connected with a different account).

I have a list of URLs to visit. Each URL needs to be visited once by one of the account (no matter which one).

To avoid some nasty global variable management, I tried to initialize each process with a class object using the initializer of multiprocessing.pool.

After that, I can't figure out how to distribute tasks to the process knowing that the function used by each process has to be in the class.

Here is a simplified version of what I'm trying to do :

from selenium import webdriver
import multiprocessing

account =  [{'account':1},{'account':2}]

class Collector():

    def __init__(self, account):

        self.account = account
        self.driver = webdriver.Chrome()

    def parse(self, item):

        self.driver.get(f"https://books.toscrape.com{item}")

if __name__ == '__main__':
    
    processes = 1
    pool = multiprocessing.Pool(processes,initializer=Collector,initargs=[account.pop()])

    items = ['/catalogue/a-light-in-the-attic_1000/index.html','/catalogue/tipping-the-velvet_999/index.html']
    
    pool.map(parse(), items, chunksize = 1)

    pool.close()
    pool.join()

The problem comes on the the pool.map line, there is no reference to the instantiated object inside the subprocess. Another approach would be to distribute URLs and parse during the init but this would be very nasty.

Is there a way to achieve this ?

CodePudding user response：

I'm not entirely certain if this solves your problem.

If you have one account per URL then you could do this:

from selenium import webdriver
from multiprocessing import Pool

items = ['/catalogue/a-light-in-the-attic_1000/index.html',
         '/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'

def process(i, a):
    print(f'Processing account {a}')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')

    with webdriver.Chrome(options=options) as driver:
        driver.get(f'{baseurl}{i}')


def main():
    with Pool() as pool:
        pool.starmap(process, zip(items, accounts))


if __name__ == '__main__':
    main()

If the number of accounts doesn't match the number of URLs, you have said that it doesn't matter which account GETs from which URL. So, in that case, you could just select the account to use at random (random.choice())

CodePudding user response：

Since Chrome starts its own process, there is really no need to be using multiprocessing when multithreading will suffice. I would like to offer a more general solution to handle the case where you have N URLs you want to retrieve where N might be very large but you would like to limit the number of concurrent Selenium sessions you have to MAX_DRIVERS where MAX_DRIVERS is a significantly smaller number. Therefore, you only want to create one driver session for each thread in the pool and reuse it as necessary. Then the problem becomes calling quit on the driver when you are finished with the pool so that you don't leave any Selenium processes behind running.

The following code uses threadlocal storage, which is unique to each thread, to store the current driver instance for each pool thread and uses a class destructor to call the driver's quit method when the class instance is destroyed:

from selenium import webdriver
from multiprocessing.pool import ThreadPool
import threading

items = ['/catalogue/a-light-in-the-attic_1000/index.html',
         '/catalogue/tipping-the-velvet_999/index.html']
accounts = [{'account': 1}, {'account': 2}]
baseurl = 'https://books.toscrape.com'

threadLocal = threading.local()

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has been "quitted".')

    @classmethod
    def create_driver(cls):
        the_driver = getattr(threadLocal, 'the_driver', None)
        if the_driver is None:
            the_driver = cls()
            threadLocal.the_driver = the_driver
        return the_driver.driver


def process(i, a):
    print(f'Processing account {a}')
    driver = Driver.create_driver()
    driver.get(f'{baseurl}{i}')


def main():
    global threadLocal

    # We never want to create more than
    MAX_DRIVERS = 8 # Rather arbitrary
    POOL_SIZE = min(len(accounts), MAX_DRIVERS)
    with ThreadPool(POOL_SIZE) as pool:
        pool.starmap(process, zip(items, accounts))
    # ensure the drivers are "quitted":
    del threadLocal
    import gc
    gc.collect() # a little extra insurance

if __name__ == '__main__':
    main()