Home > database >  How to download a file using Python requests, when that file is being served with redirect?
How to download a file using Python requests, when that file is being served with redirect?

Time:02-22

I'm trying to download a book from Fadedpage, like this one. If you click on the link to the HTML file there, it will display the HTML file. The URL appears to be https://www.fadedpage.com/books/20170817/html.php. But if you try to download that URL by any of the usual means, you only get the metadata HTML, not the HTML with the full text of the book. For instance, running wget https://www.fadedpage.com/books/20170817/html.php from the command line does return HTML, but it's again the metadata HTML file from https://www.fadedpage.com/showbook.php?pid=20170817, not the full text of the book.

Here's what I've tried so far:

def downloadFile(bookID, fileType="html"): 
    url = f"https://www.fadedpage.com/books/{bookID}/{fileType}.php"
    #url = f'https://www.fadedpage.com/link.php?file={bookID}.{fileType}'
    headers = {"Accept":"text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
               "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.3 Chrome/87.0.4280.144 Safari/537.36",
               "referer": "https://www.fadedpage.com/showbook.php?pid={bookID}",
               "sec-fetch-dest": "document",
                "sec-fetch-mode": "navigate",
                "sec-fetch-site": "same-origin",
                "sec-fetch-user": "?1",
                "upgrade-insecure-requests": "1",
                "cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"
              }
    print("Getting ", url)
    resp = requests.get(url, headers=headers, cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"})
    if resp.ok: 
        return resp.text

I'm trying to give it the same headers that my web browser is giving it, in the hopes that it'll return the same thing. But it's not working.

Is there something else I need to do to be able to download this HTML file? Since it's served by PHP on the server side, I'm having a hard time reverse engineering this.

For reference, the full HTML file contains the text "The First Part of this book is intended for pupils so far advanced as to be able to distinguish the Parts of Speech." But that text is not contained in the metadata HTML file.

Testing

Here's another way of testing this:

def isValidDownload(bookID, fileType="html"): 
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "It was a woodland 
    slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting 
    the full text file. 
    """
    with open(f"{bookID}.{fileType}") as f: 
        raw = f.read()
    test = "woodland slope behind St. Pierre-les-Bains"
    return test in raw

This should return True:

downloadFile("20170817", "html")
isValidDownload("20170817", "html")
False

Another attempt

A simpler version, based on the answer below, also doesn't work. Here it is all together:

def downloadFile(bookID, fileType): 
    headers = {"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}
    url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
    print("Getting ", url)
    with requests.get(url, headers = headers) as resp:
        with open(f"{bookID}.{fileType}", 'wb') as f:
            f.write(resp.content)

def isValidDownload(bookID, fileType="html"): 
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "It was a woodland 
    slope behind St. Pierre-les-Bains". If it doesn't, it isn't getting 
    the full text file. 
    """
    with open(f"{bookID}.{fileType}") as f: 
        raw = f.read()
    test = "woodland slope behind St. Pierre-les-Bains"
    return test in raw

downloadFile("20170817", "html")
isValidDownload("20170817", "html")

That returns False.

CodePudding user response:

  1. Pass cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"} instead of headers={"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}.
    This is because the requests library does headers.pop('Cookie', None) upon redirect.
  2. Retry if resp.url is not f"https://www.fadedpage.com/books/{bookID}/{fileType}.php".
    This is because the server first redirects link.php with a different bookID to showbook.php.
  3. A download of downloadFile("20170817", "html") contains the text "The First Part of this book is intended for pupils", not "woodland slope behind St. Pierre-les-Bains" that is contained in a download downloadFile("20130603", "html").
def downloadFile(bookID, fileType, retry=1):
    cookies = {"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
    url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
    print("Getting ", url)
    with requests.get(url, cookies=cookies) as resp:
        if resp.url != f"https://www.fadedpage.com/books/{bookID}/{fileType}.php":
            if retry:
                return downloadFile(bookID, fileType, retry=retry-1)
            else:
                raise Exception
        with open(f"{bookID}.{fileType}", 'wb') as f:
            f.write(resp.content)

def isValidDownload(bookID, fileType="html"):
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "The First Part of
    this book is intended for pupils". If it doesn't, it isn't getting
    the full text file.
    """
    with open(f"{bookID}.{fileType}") as f:
        raw = f.read()
    test = ""
    if bookID == "20130603":
        test = "woodland slope behind St. Pierre-les-Bains"
    if bookID == "20170817":
        test = "The First Part of this book is intended for pupils"
    return test in raw
  • Related