I want to check if there is any content available on more than 500 webpages, using beautiful soup. This is the is script I wrote. It works, but somewhere it stops. If I fix the error it shows a different one. Below is the code I tried. I just want to be sure the page has a body. I'm unsure how to handle timeouts. Maybe the website needs more time.
method 1:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems == '':
pass
else:
print('body found')
method 2:
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems != '':
print('body found')
else:
pass
CodePudding user response:
select()
returns a list, not a string, so it will always compare not equal to ''
, whether it's successful or not. Just test if the result is not empty.
Use try/except
to catch the timeout error.
try:
res = requests.get(full_https_url, timeout=40)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems:
# do stuff
else:
print("No body in {full_https_url}")
except requests.exceptions.Timeout:
print(f"Timeout on {full_https_url}, skipping")