Home > OS >  Return list in the original order when when using concurrent futures
Return list in the original order when when using concurrent futures

Time:09-21

I'm using concurrent futures to speed up an IO bound process (retrieving the H1 heading from a list of urls found on the Wayback Machine. The code works, but it returns the list in an arbitrary order. I'm looking for a way to return the URLs in the same order as the original list.

archive_url_list = ['https://web.archive.org/web/20171220002410/http://www.manueldrivingschool.co.uk:80/areas-covered-for-driving-lessons', 'https://web.archive.org/web/20210301102140/https://www.manueldrivingschool.co.uk/contact.php', 'https://web.archive.org/web/20210301102140/https://www.manueldrivingschool.co.uk/contact.php', 'https://web.archive.org/web/20171220002415/http://www.manueldrivingschool.co.uk:80/contact', 'https://web.archive.org/web/20160520140505/http://www.manueldrivingschool.co.uk:80/about.php', 'https://web.archive.org/web/20180102123922/http://www.manueldrivingschool.co.uk:80/about']

import waybackpy
import concurrent.futures

archive_h1_list = []
def get_archive_h1(h1_url):
    html = urlopen(h1_url)
    bsh = BeautifulSoup(html.read(), 'lxml')
    return bsh.h1.text.strip()

def concurrent_calls():
    with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
        f1 = (executor.submit(get_archive_h1, h1_url) for h1_url in archive_url_list)
        for future in concurrent.futures.as_completed(f1):
            try:
                data = future.result()
                archive_h1_list.append(data)
            except Exception:
                archive_h1_list.append("No Data Received!")
                pass

if __name__ == '__main__':
    concurrent_calls()
    print(archive_h1_list)

I've tried creating a second list to append the original URL to as the code runs in the hope I can tie it back after the fact, but all I get is an empty list. New to concurrent futures, hoping there's a standard way.

CodePudding user response:

Instead of a generator with ThreadPoolExecutor.submit, use ThreadPoolExecutor.map for order:

def concurrent_calls():
    with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
        f1 = executor.map(get_archive_h1, archive_url_list)
        ...

This is much more efficient.

  • Related