I'm trying to get a list of urls from the wayback machine using the waybackpy library. The trouble is, it's very slow and I think it can be speed up using multithreading.
I can see why my code doesn't work (each thread work iterate over the same list in the function), but I can't figure out how to make it work. Here's my code:
import waybackpy
url_list = ["https://www.google.com", "https://www.facebook.com", "https://www.wikipedia.com", "https://www.walmart.com/", "https://www.ebay.com/", "https://www.amazon.com"]
def get_archive_url(threads):
counter = 1
for url in url_list:
try:
target_url = waybackpy.Url(url, user_agent)
newest_archive = target_url.newest()
archive_url_list.append(newest_archive)
counter = counter 1
except Exception:
return("Error Retrieving URL from Archive.org")
pass
with concurrent.futures.ThreadPoolExecutor() as executor:
f1 = executor.submit(get_archive_url, 2)
print(f1.result())
I can't figure out a way to split out the list so that it can be assigned to different threads. I've searched and tried many of the top answers on here, but can't get my head around it or make it work.
CodePudding user response:
You're right, I always also find concurrent futures a bit hard to get my head around, but as you stated, you are looping at the wrong point, so the whole loop is happening inside a single thread. You could try something like this:
import concurrent.futures
import waybackpy
CONNECTIONS = 2 # increase this number to run more workers at the same time.
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
url_list = ["https://www.google.com", "https://www.facebook.com", "https://www.wikipedia.com", "https://www.walmart.com/", "https://www.ebay.com/", "https://www.amazon.com"]
archive_url_list = []
def get_archive_url(url):
target_url = waybackpy.Url(url, user_agent)
newest_archive = target_url.newest()
return newest_archive
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
f1 = (executor.submit(get_archive_url, url) for url in url_list)
for future in concurrent.futures.as_completed(f1):
try:
data = future.result().archive_url
except Exception as e:
data = ('error', e)
finally:
archive_url_list.append(data)
print(data)
if __name__ == '__main__':
concurrent_calls()
print(archive_url_list)