I'm using concurrent futures to scrape a list of H1s from Webpages and append them to a list called archive_h1_list. The issue is, as soon as concurrent futures hits an exception, it stops appending the list.
When I print the resultant list below, it stops after the first exception. ['Example Domain', 'Example Domain', 'Exception Error!']
it never continues to process the last https://www.example.com h1 in the list after hitting an exception.
import concurrent.futures
from urllib.request import urlopen
from bs4 import BeautifulSoup
CONNECTIONS = 8
archive_url_list = ["https://www.example.com", "https://www.example.com", "sdfihaslkhasd", "https://www.example.com"]
archive_h1_list = []
def get_archive_h1(h1_url):
html = urlopen(h1_url)
bsh = BeautifulSoup(html.read(), 'lxml')
return bsh.h1.text.strip()
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
f1 = executor.map(get_archive_h1, archive_url_list)
try:
for future in f1:
archive_h1_list.append(future)
except Exception:
archive_h1_list.append("Exception Error!")
pass
The expected output should be:
['Example Domain', 'Example Domain', 'Exception Error!', 'Example Domain']
CodePudding user response:
It's because your for
loop is inside try
and when you catch the exception the try
block is suspended and the except
block is being executed, thus your for
loop gets interrupted.
One way to solve it would be to move your for loop outside the try
block however according to the documentation of Executor.map
:
If a func call raises an exception, then that exception will be raised when its value is retrieved from the iterator.
Which makes the exception handling pretty nasty outside of your function.
So the first solution is to catch the exceptions inside get_archive_h1
:
def get_archive_h1(h1_url):
try:
html = urlopen(h1_url)
bsh = BeautifulSoup(html.read(), 'lxml')
return bsh.h1.text.strip()
except Exception:
return "Exception Error!"
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
f1 = executor.map(get_archive_h1, archive_url_list)
for future in f1:
archive_h1_list.append(future)
The other solution is to use a different executor method where you have more control over your future resolution, i.e. Executor.submit
:
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
futures = [executor.submit(get_archive_h1, url) for url in archive_url_list]
for future in futures:
try:
archive_h1_list.append(future.result())
except Exception:
archive_h1_list.append("Exception Error!")
pass