Home > database >  Download PDFs with incomplete URLs in HTML with Beautiful Soup/Requests
Download PDFs with incomplete URLs in HTML with Beautiful Soup/Requests

Time:09-28

I want to download all 259 the PDFs listed on the page https://www.mdpi.com/search?authors=University of Alabama, Tuscaloosa, e.g.:

<a href="/1424-8220/21/19/6384/pdf" class="UD_Listings_ArticlePDF" onclick="if (!window.__cfRLUnblockHandlers) return false; ga('send', 'pageview', '/1424-8220/21/19/6384/pdf');" title="Article PDF" data-cf-modified-fa685c2bcda960230d46973e-="">
<i class="material-icons">get_app</i>
</a>

The href only has the part of the URL after the domain, so the full URL is https://mdpi.com/1424-8220/21/19/6384/pdf.

When I run this to download the file:

for link in links:
    if ('/pdf' in link.get('href', [])):
        i  = 1
        print("Downloading file: ", i)
        response = requests.get(link.get('href'))

I get this traceback:

requests.exceptions.MissingSchema: Invalid URL '/1424-8220/21/19/6384/pdf': No schema supplied. Perhaps you meant http:///1424-8220/21/19/6384/pdf?

Where do I put the missing part of the URL, "https://mdpi.com"?

CodePudding user response:

.get() is accepting a string, so f-string should work.

for link in links:
    if ('/pdf' in link.get('href', [])):
        i  = 1
        print("Downloading file: ", i)
        response = requests.get(f"https://mdpi.com{link.get('href')}")

  • Related