Home > database >  How to skip to the next link if a site cant be reached with BeautifulSoup?
How to skip to the next link if a site cant be reached with BeautifulSoup?

Time:07-21

I'm currently coding a Python project which needs to do the following:

-the user inputs multiple links to different sites

-the script scrapes information from these sites and writes the output in a .txt file

The problem I have is that if a site can't be reached (for example a random link: oflexertzue.com) the whole script will be stopped and I have to restart it.

This is the error message I get if a site can't be reached:

Failed to establish a new connection: [Errno 11001] getaddrinfo failed'

I was trying to find a way to skip to the next link or to build in an exception and to output 'exception' into the text file. I have also tried using 'try/except' but I had no luck with it.

This is the code I currently have for my script:

from time import sleep
import requests
from bs4 import BeautifulSoup

http = 'http://'

input_1 = input("Link: ").split(',')
link = [http   site for site in input_1]

open("output.txt", 'w').close()

for url in link:
    sleep(1)

    website = requests.get(url)
    results = BeautifulSoup(website.content, 'html.parser')
    all_div = results.find_all("div", class_="rte", limit = 1)

    #[information I want to scrape from a site]
    #[...]

    file = open("output.txt", 'a', encoding="utf-8")
    file.write("\n")
    file.write('         '   ' '   url   ' '   '         ')
    file.write(output)
    file.write("\n")
    file.close()

CodePudding user response:

Simply put, let a context manager take care of I/O operations, and place the try block inside the loop. I also added foo(), where you can add further operations on all_div:

def foo(x):
    return ''.join([f"{s.get_text()}\n" for s in x])


def scrape(input_links) -> None:
    with open("output.txt", 'a', encoding="utf-8") as file:
        for url in input_links:
            try:
                sleep(1)
                website = requests.get(url)
                results = BeautifulSoup(website.content, 'html.parser')
                all_div = results.find_all("div", class_="rte", limit=1)
                output = foo(all_div)
            except Exception as ex:
                file.write(f"\n>>>>>>>>>>>> {ex}\n")
            else:
                file.write(f"\n          {url}          \n{output}\n")


scrape(link)
  • Related