I'm currently coding a Python project which needs to do the following:
-the user inputs multiple links to different sites
-the script scrapes information from these sites and writes the output in a .txt file
The problem I have is that if a site can't be reached (for example a random link: oflexertzue.com) the whole script will be stopped and I have to restart it.
This is the error message I get if a site can't be reached:
Failed to establish a new connection: [Errno 11001] getaddrinfo failed'
I was trying to find a way to skip to the next link or to build in an exception and to output 'exception' into the text file. I have also tried using 'try/except' but I had no luck with it.
This is the code I currently have for my script:
from time import sleep
import requests
from bs4 import BeautifulSoup
http = 'http://'
input_1 = input("Link: ").split(',')
link = [http site for site in input_1]
open("output.txt", 'w').close()
for url in link:
sleep(1)
website = requests.get(url)
results = BeautifulSoup(website.content, 'html.parser')
all_div = results.find_all("div", class_="rte", limit = 1)
#[information I want to scrape from a site]
#[...]
file = open("output.txt", 'a', encoding="utf-8")
file.write("\n")
file.write(' ' ' ' url ' ' ' ')
file.write(output)
file.write("\n")
file.close()
CodePudding user response:
Simply put, let a context manager take care of I/O operations, and place the try
block inside the loop. I also added foo()
, where you can add further operations on all_div
:
def foo(x):
return ''.join([f"{s.get_text()}\n" for s in x])
def scrape(input_links) -> None:
with open("output.txt", 'a', encoding="utf-8") as file:
for url in input_links:
try:
sleep(1)
website = requests.get(url)
results = BeautifulSoup(website.content, 'html.parser')
all_div = results.find_all("div", class_="rte", limit=1)
output = foo(all_div)
except Exception as ex:
file.write(f"\n>>>>>>>>>>>> {ex}\n")
else:
file.write(f"\n {url} \n{output}\n")
scrape(link)