How to select and download only specific PDF from a website?-CodePudding

I found some code online that allows you to download all the PDF found from a url and it works, but it fails on the website I need it for. Im trying to download the PDF of the menu for each day of the week and I can't seem to figure out how to narrow it down to only those 7 pdf files.

from bs4 import BeautifulSoup
import requests


url = "https://calbaptist.edu/dining/alumni-dining-commons"

# Requests URL and get response object
response = requests.get(url)
  
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
  
# Find all hyperlinks present on webpage
links = soup.find_all('a')
  
i = 0
  
# From all links check for pdf link and
# if present download file
for link in links:
    if (".pdf" in link.get('href', [])):
        i  = 1
        print("Downloading file: ", i)
  
        # Get response object for link
        response = requests.get(link.get('href'))
  
        # Write content in pdf file
        pdf = open("pdf" str(i) ".pdf", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")
  
print("All PDF files downloaded")

I tried to change the if-statement to instead of looking for .pdf to look for /dining/menus-and-hours/adc-menus/. This gave me an error on the line that gets the responce object for the link.

CodePudding user response：

Check the href values, they are relative and not absolute, so you have to prepend the "base url".

You could also select your elements more specific with css selector like contains something:

soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]')

or ends with .pdf

soup.select('a[href$=".pdf"]')

May also take a look at enumerat():

for i,e in enumerate(soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]'),start=1):

Checking content type of reponse header:

requests.get('https://calbaptist.edu' e.get('href')).headers['Content-Type']

Example

from bs4 import BeautifulSoup
import requests


url = "https://calbaptist.edu/dining/alumni-dining-commons"
soup = BeautifulSoup(requests.get(url).text)

for i,e in enumerate(soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]'),start=1):
    r = requests.get('https://calbaptist.edu' e.get('href'))
    if r.headers['Content-Type'] == 'application/pdf':
        pdf = open("pdf" str(i) ".pdf", 'wb')
        pdf.write(r.content)
        pdf.close()
        print("File ", i, " downloaded")