Home > other >  Downloading pdf files from php server || saving not available files
Downloading pdf files from php server || saving not available files

Time:12-31

I am trying to download the PDFs (a few can be word files, very rarely) located on a PHP server. It appears that on the server, the PDFs are numbered increasingly from 1 to 14000. The PDFs can be downloaded using the link: http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=X, where X is a number in the [1, 14000] range. I am using the following code for X = 200, which I can then loop over all the [1, 14000] values to save all the files in a specific folder. The code currently creates a pdf file with zero bytes size if the pdf doesn't exist, corresponding to an X value. I am using the following code to run a test on 20 X values for which pdfs do not exist.

import requests

urls = [('13980', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13980'),
        ('13981', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13981'),
        ('13982', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13982'),  
        ('13983', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13983'), 
        ('13984', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13984'), 
        ('13985', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13985'), 
        ('13986', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13986'), 
        ('13987', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13987'),
        ('13988', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13988'),
        ('13989', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13989'), 
        ('13990', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13990'), 
        ('13991', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13991'), 
        ('13992', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13992'), 
        ('13993', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13993'), 
        ('13994', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13994'), 
        ('13995', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13995'), 
        ('13996', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13996'), 
        ('13997', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13997'), 
        ('13998', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13998'), 
        ('13999', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13999'), 
        ('14000', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=14000')]

for number, url in urls:
    s = requests.Session()
    response = s.get(url)
    
    with open("/Users/aartimalik/Downloads/test/"   number   "_phptest.pdf", "wb") as f:
        f.write(response.content)
        f.close()

This code saves 0-byte pdfs because pdfs corresponding to those numbers do not exist. I want it to: save .pdf files only if there's a pdf file corresponding to an x file and return "no pdf file" if it doesn't exist...I'm not sure if it's possible with with open. Any help is appreciated. Thanks!

CodePudding user response:

The following worked (can modify it to include pdfs):

import requests
import os

os.chdir("/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs")

from phpurldoc import urls

print(urls)

for number, url in urls:
    s = requests.Session()
    response = s.get(url)
    h = response.headers["Content-Disposition"].split("=")[-1]

    if h[-1] == "x":
        with open("./bidsummaries-doc/"   h   "_"   number   ".docx", "wb") as f:
            f.write(response.content)
            f.close()

    else:
        with open("./bidsummaries-doc/"   h   "_"   number   ".doc", "wb") as f:
            f.write(response.content)
            f.close()
  • Related