How do I read through a HTTP response and select a particular part of it?-CodePudding

I am trying to web scrape by requesting the url, using beautiful soup to search for all i and a tags.

I am struggling to figure out how to make sure the script ensures that both tags are present before writing this to a txt file and then reading it once more to extract the url.

I am ultimately creating a pdf downloader that will crawl the website, find the links, open them up and download the pdf files on the final page.

When I try the if FILETYPE in file_link:

I get this error:

if FILETYPE in file_link:
TypeError: argument of type 'NoneType' is not iterable

How can I rectify this?

Here is my code:

from bs4 import BeautifulSoup as bs
import requests
import constants as c

URL = c.url
DOMAIN = c.domain
FILETYPE = '.html'


def get_soup(url):
    return bs(requests.get(url).text, 'html.parser')


for link in get_soup(URL).find_all('a'):
    file_link = link.get('href')
    if FILETYPE in file_link:
        print(f"{DOMAIN}{file_link}")

CodePudding user response：

I did some experimenting on my own and it seemed to work. I think your issue is that an tag in the page you are trying to scrape doesn't have an href attribute.

I changed your code a little bit to handle that case

from bs4 import BeautifulSoup as bs
import requests

URL = "https://www.vatican.va/offices/papal_docs_list.html"
FILETYPE = '.html'


def get_soup(url):
    return bs(requests.get(url).text, 'html.parser')


for link in get_soup(URL).find_all('a'):
    if (file_link := link.get("href")) is not None:
        if FILETYPE in file_link:
            print(f"{file_link}")

In the for loop I assign the file_link variable using the walrus operator to directly check if it is not None and only execute the check for the ".html" if it has an actual value.

Hope this helps