I'm using concurrent futures to speed up an IO bound process (retrieving the H1 heading from a list of urls found on the Wayback Machine. The code works, but it returns the list in an arbitrary order. I'm looking for a way to return the URLs in the same order as the original list.
archive_url_list = ['https://web.archive.org/web/20171220002410/http://www.manueldrivingschool.co.uk:80/areas-covered-for-driving-lessons', 'https://web.archive.org/web/20210301102140/https://www.manueldrivingschool.co.uk/contact.php', 'https://web.archive.org/web/20210301102140/https://www.manueldrivingschool.co.uk/contact.php', 'https://web.archive.org/web/20171220002415/http://www.manueldrivingschool.co.uk:80/contact', 'https://web.archive.org/web/20160520140505/http://www.manueldrivingschool.co.uk:80/about.php', 'https://web.archive.org/web/20180102123922/http://www.manueldrivingschool.co.uk:80/about']
import waybackpy
import concurrent.futures
archive_h1_list = []
def get_archive_h1(h1_url):
html = urlopen(h1_url)
bsh = BeautifulSoup(html.read(), 'lxml')
return bsh.h1.text.strip()
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
f1 = (executor.submit(get_archive_h1, h1_url) for h1_url in archive_url_list)
for future in concurrent.futures.as_completed(f1):
try:
data = future.result()
archive_h1_list.append(data)
except Exception:
archive_h1_list.append("No Data Received!")
pass
if __name__ == '__main__':
concurrent_calls()
print(archive_h1_list)
I've tried creating a second list to append the original URL to as the code runs in the hope I can tie it back after the fact, but all I get is an empty list. New to concurrent futures, hoping there's a standard way.
CodePudding user response:
Instead of a generator with ThreadPoolExecutor.submit
, use ThreadPoolExecutor.map
for order:
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
f1 = executor.map(get_archive_h1, archive_url_list)
...
This is much more efficient.