Home > OS >  Threads Running One After One in Python?
Threads Running One After One in Python?

Time:07-08

Update1:

If I change the code inside the for loop to:

print('processing new page')
pool.apply_async(time.sleep, (5,))

I see 5 sec delay after Every printing, so the problem isn't related to webdriver.

Update2:

Thanks for @user56700 but I'm interested in knowing what did I do here and how to fix without switching from the way I'm using threads.


In python I have the following code:

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
for url in url:
    try:
        print('processing new page')
        result = parse_page(driver, url) # Visit url via driver, wait for it to load and parse its contents (takes 30 sec per page)
        # Change global variables
    except Exception as e:
        log_warning(str(e))

If I have 10 pages the above code needs 300 seconds to finish which is a lot.

I read about something called threading in python: https://stackoverflow.com/a/15144765/19500354 so I wanted to use it but I'm not sure if I'm doing it the right way.

Here's my try:

import threading
from multiprocessing.pool import ThreadPool as Pool
G_LOCK = threading.Lock()

driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
pool = Pool(10)
for url in url:
    try:
        print('processing new page')
        result = pool.apply_async(parse_page, (driver, url,)).get()
        G_LOCK.acquire()
        # Change global variables
        G_LOCK.release()
    except Exception as e:
        log_warning(str(e))

pool.close()
pool.join()

# Here I want to make sure ALL threads have finished working before running the below code

Why is my implementation wrong? note I'm using same driver instance

I tried to print time next to processing new page and I see:

[10:36:02] processing new page
[10:36:09] processing new page
[10:36:15] processing new page
[10:36:22] processing new page
[10:36:39] processing new page

Which means something is wrong as I would expect 1 sec diff nothing more. as all I'm doing is to change global variables.

CodePudding user response:

I just created a simple example to show case how I would solve it. You need to add your own code of course.

from concurrent.futures import ThreadPoolExecutor, as_completed
from selenium import webdriver

driver = webdriver.Chrome()
urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
your_data = []

def parse_page(driver, url):
    driver.get(url)
    data = driver.title
    return(data)

with ThreadPoolExecutor(max_workers=10) as executor:
    results = {executor.submit(parse_page, driver, url) for url in urls}
    for result in as_completed(results):
        your_data.append(result.result())

driver.close()
print(your_data)

Result:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']

If you want, you could use the webdriver as a context manager to avoid having to close it, like this:

with webdriver.Chrome() as driver:
    urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
    your_data = []

    def parse_page(driver, url):
        driver.get(url)
        data = driver.title
        return(data)

    with ThreadPoolExecutor(max_workers=10) as executor:
        results = {executor.submit(parse_page, driver, url) for url in urls}
        for result in as_completed(results):
            your_data.append(result.result())

print(your_data)

Example using the multiprocessing.pool library:

from selenium import webdriver
from multiprocessing.pool import ThreadPool

with webdriver.Chrome() as driver:
    urls = ["https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org", "https://www.wikipedia.org"]
    your_data = []

    def parse_page(driver, url):
        driver.get(url)
        data = driver.title
        return(data)

    pool = ThreadPool(processes=10)
    results = [pool.apply_async(parse_page, (driver, url)) for url in urls]
    for result in results:
        your_data.append(result.get())

print(your_data)

Result:

['Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia', 'Wikipedia']
  • Related