I found some code online that allows you to download all the PDF found from a url and it works, but it fails on the website I need it for. Im trying to download the PDF of the menu for each day of the week and I can't seem to figure out how to narrow it down to only those 7 pdf files.
from bs4 import BeautifulSoup
import requests
url = "https://calbaptist.edu/dining/alumni-dining-commons"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
for link in links:
if (".pdf" in link.get('href', [])):
i = 1
print("Downloading file: ", i)
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf" str(i) ".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
I tried to change the if-statement
to instead of looking for .pdf
to look for /dining/menus-and-hours/adc-menus/
. This gave me an error on the line that gets the responce object for the link.
CodePudding user response:
Check the href
values, they are relative and not absolute, so you have to prepend the "base url".
You could also select your elements more specific with css selector
like contains something:
soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]')
or ends with .pdf
soup.select('a[href$=".pdf"]')
May also take a look at enumerat()
:
for i,e in enumerate(soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]'),start=1):
Checking content type of reponse header:
requests.get('https://calbaptist.edu' e.get('href')).headers['Content-Type']
Example
from bs4 import BeautifulSoup
import requests
url = "https://calbaptist.edu/dining/alumni-dining-commons"
soup = BeautifulSoup(requests.get(url).text)
for i,e in enumerate(soup.select('a[href*="/dining/menus-and-hours/adc-menus/"]'),start=1):
r = requests.get('https://calbaptist.edu' e.get('href'))
if r.headers['Content-Type'] == 'application/pdf':
pdf = open("pdf" str(i) ".pdf", 'wb')
pdf.write(r.content)
pdf.close()
print("File ", i, " downloaded")