When using Pool.map from integrated Python's multiprocessing, program works slower and slower-CodePudding

Sample of code that uses Pool:

from multiprocessing import Pool
Pool(processes=6).map(some_func, array)

After few iterations the program slows down and finally it becomes even slower than without multiprocessing. Maybe the problem is that the function related to Selenium? Here is full code:

# libraries
import os
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from multiprocessing import Pool

# Необходимые переменные
url = "https://eldorado.ua/"
directory = os.path.dirname(os.path.realpath(__file__))
env_path = directory   "\chromedriver"
chromedriver_path = env_path   "\chromedriver.exe"

dict1 = {"Смартфоны и телефоны": "https://eldorado.ua/node/c1038944/",
         "Телевизоры и аудиотехника": "https://eldorado.ua/node/c1038957/",
         "Ноутбуки, ПК и Планшеты": "https://eldorado.ua/node/c1038958/",
         "Техника для кухни": "https://eldorado.ua/node/c1088594/",
         "Техника для дома": "https://eldorado.ua/node/c1088603/",
         "Игровая зона": "https://eldorado.ua/node/c1285101/",
         "Гаджеты и аксесуары": "https://eldorado.ua/node/c1215257/",
         "Посуда": "https://eldorado.ua/node/c1039055/",
         "Фото и видео": "https://eldorado.ua/node/c1038960/",
         "Красота и здоровье": "https://eldorado.ua/node/c1178596/",
         "Авто и инструменты": "https://eldorado.ua/node/c1284654/",
         "Спорт и туризм": "https://eldorado.ua/node/c1218544/",
         "Товары для дома и сада": "https://eldorado.ua/node/c1285161/",
         "Товары для детей": "https://eldorado.ua/node/c1085100/"}


def openChrome_headless(url1, name):
    options = webdriver.ChromeOptions()
    options.headless = True
    options.add_experimental_option("excludeSwitches", ['enable-automation'])
    options.add_argument(
        '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"')
    driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
    driver.get(url=url1)
    sleep(1)
    try:
        with open(name   ".html", "w", encoding="utf-8") as file:
            file.write(driver.page_source)
    except Exception as ex:
        print(ex)
    finally:
        driver.close()
        driver.quit()


def processing_goods_pages(name):
    for n in os.listdir(f"brand_pages\\{name}"):
        with open(f"{directory}\\brand_pages\\{name}\\{n}", encoding="utf-8") as file:
            soup = BeautifulSoup(file.read(), "lxml")

        if not os.path.exists(f"{directory}\\goods_pages\\{name}\\{n[:-5]}"):
            if not os.path.exists(f"{directory}\\goods_pages\\{name}"):
                os.mkdir(f"{directory}\\goods_pages\\{name}")
            os.mkdir(f"{directory}\\goods_pages\\{name}\\{n[:-5]}")

        links = soup.find_all("header", class_="good-description")
        for li in links:
            ref = url   li.find('a').get('href')
            print(li.text)
            openChrome_headless(ref, f"{directory}\\goods_pages\\{name}\\{n[:-5]}\\{li.text}")


if __name__ == "__main__":
    ar2 = []
    for k, v in dict1.items():
        ar2.append(k)
    Pool(processes=6).map(processing_goods_pages, ar2)

CodePudding user response：

You are creating 6 processes to process 14 URLs -- so far so good. But then each process in the pool in order to process a URL is launching a headless Chrome browser once for each link it reads from a file for that URL. I don't know how many links on the average it processes for each URL and I can't say that opening and closing Chrome so many times is the cause of the eventual slowdown. But it seems to me that if you want a multiprocessing level of 6, then you should never have to have more than 6 Chrome sessions started. To accomplish this, however, takes a bit of code refactoring.

The first thing I would note is that this job could probably just as well use multithreading instead of multiprocessing. There is some CPU-intensive work done by BeautifulSoup and the lxml parser, but I suspect this pales in comparison to launching Chrome 6 times and fetching the URL pages, especially since you have a hard-coded wait of 1 second following the URL fetch (more on this later).

The idea is to store in thread local storage the currently open Chrome driver for each thread in the multithreading pool and to never quit the driver until the end of the program. The logic that was in function openChrome_headless now needs to be moved to a new special function create_driver that can be called by processing_goods_pages to get the current Chrome driver for the current thread (or create one if there isn't one currently). But that means the URL-specific code that had been in openChrome_headlesss now needs to be moved to processing_goods_pages.

Finally, thread local storage is deleted and the garbage collector is run to ensure that the destructor for all the instances of class Driver are run to ensure that all the Chrome driver instances are "quitted."

Since I do not have access to your files, this obviously could not be thoroughly tested, so there could be a spelling error or 10. Good luck.

One further note: Instead of doing a call to sleep(1) following the driver.get(ref) call, you should look into doin instead a call to driver.implicitly_wait(1) followed by a driver call to locate an element whose presence ensures that everything you need on the page for writing out has been load, if such a thing is possible. In that way you are only waiting the minimum time necessary for the links to be present. Of course, if the DOM is not modified subsequent to the initial page load via AJAX calls, there is no need to sleep at all.

import os
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
# Use multithreading instead of multiprocessing
from multiprocessing.pool import ThreadPool
import threading

# Необходимые переменные
url = "https://eldorado.ua/"
directory = os.path.dirname(os.path.realpath(__file__))
env_path = directory   "\chromedriver"
chromedriver_path = env_path   "\chromedriver.exe"

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.headless = True
        options.add_experimental_option("excludeSwitches", ['enable-automation'])
        options.add_argument(
            '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"')
        self.driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')

threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


dict1 = {"Смартфоны и телефоны": "https://eldorado.ua/node/c1038944/",
         "Телевизоры и аудиотехника": "https://eldorado.ua/node/c1038957/",
         "Ноутбуки, ПК и Планшеты": "https://eldorado.ua/node/c1038958/",
         "Техника для кухни": "https://eldorado.ua/node/c1088594/",
         "Техника для дома": "https://eldorado.ua/node/c1088603/",
         "Игровая зона": "https://eldorado.ua/node/c1285101/",
         "Гаджеты и аксесуары": "https://eldorado.ua/node/c1215257/",
         "Посуда": "https://eldorado.ua/node/c1039055/",
         "Фото и видео": "https://eldorado.ua/node/c1038960/",
         "Красота и здоровье": "https://eldorado.ua/node/c1178596/",
         "Авто и инструменты": "https://eldorado.ua/node/c1284654/",
         "Спорт и туризм": "https://eldorado.ua/node/c1218544/",
         "Товары для дома и сада": "https://eldorado.ua/node/c1285161/",
         "Товары для детей": "https://eldorado.ua/node/c1085100/"}

def processing_goods_pages(name):
    for n in os.listdir(f"brand_pages\\{name}"):
        with open(f"{directory}\\brand_pages\\{name}\\{n}", encoding="utf-8") as file:
            soup = BeautifulSoup(file.read(), "lxml")

        if not os.path.exists(f"{directory}\\goods_pages\\{name}\\{n[:-5]}"):
            if not os.path.exists(f"{directory}\\goods_pages\\{name}"):
                os.mkdir(f"{directory}\\goods_pages\\{name}")
            os.mkdir(f"{directory}\\goods_pages\\{name}\\{n[:-5]}")

        links = soup.find_all("header", class_="good-description")
        driver = create_driver()
        for li in links:
            ref = url   li.find('a').get('href')
            print(li.text)
            driver.get(ref)
            sleep(1)
            name = f"{directory}\\goods_pages\\{name}\\{n[:-5]}\\{li.text}"
            try:
                with open(name   ".html", "w", encoding="utf-8") as file:
                    file.write(driver.page_source)
            except Exception as ex:
                print(ex)

if __name__ == "__main__":
    ThreadPool(processes=6).map(processing_goods_pages, dict1.keys())
    # Quit all the Selenium drivers:
    del threadLocal
    import gc
    gc.collect() # a little extra insurance