I want to download all 259 the PDFs listed on the page https://www.mdpi.com/search?authors=University of Alabama, Tuscaloosa, e.g.:
<a href="/1424-8220/21/19/6384/pdf" class="UD_Listings_ArticlePDF" onclick="if (!window.__cfRLUnblockHandlers) return false; ga('send', 'pageview', '/1424-8220/21/19/6384/pdf');" title="Article PDF" data-cf-modified-fa685c2bcda960230d46973e-="">
<i class="material-icons">get_app</i>
</a>
The href only has the part of the URL after the domain, so the full URL is https://mdpi.com/1424-8220/21/19/6384/pdf.
When I run this to download the file:
for link in links:
if ('/pdf' in link.get('href', [])):
i = 1
print("Downloading file: ", i)
response = requests.get(link.get('href'))
I get this traceback:
requests.exceptions.MissingSchema: Invalid URL '/1424-8220/21/19/6384/pdf': No schema supplied. Perhaps you meant http:///1424-8220/21/19/6384/pdf?
Where do I put the missing part of the URL, "https://mdpi.com"?
CodePudding user response:
.get()
is accepting a string, so f-string should work.
for link in links:
if ('/pdf' in link.get('href', [])):
i = 1
print("Downloading file: ", i)
response = requests.get(f"https://mdpi.com{link.get('href')}")