I am trying to download the PDFs (a few can be word files, very rarely) located on a PHP server. It appears that on the server, the PDFs are numbered increasingly from 1 to 14000. The PDFs can be downloaded using the link: http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=X, where X is a number in the [1, 14000] range. I am using the following code for X = 200, which I can then loop over all the [1, 14000] values to save all the files in a specific folder. The code currently creates a pdf file with zero bytes size if the pdf doesn't exist, corresponding to an X value. I am using the following code to run a test on 20 X values for which pdfs do not exist.
import requests
urls = [('13980', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13980'),
('13981', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13981'),
('13982', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13982'),
('13983', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13983'),
('13984', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13984'),
('13985', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13985'),
('13986', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13986'),
('13987', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13987'),
('13988', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13988'),
('13989', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13989'),
('13990', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13990'),
('13991', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13991'),
('13992', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13992'),
('13993', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13993'),
('13994', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13994'),
('13995', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13995'),
('13996', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13996'),
('13997', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13997'),
('13998', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13998'),
('13999', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=13999'),
('14000', 'http://ppmoe.dot.ca.gov/des/oe/awards/bidsum/dl.php?id=14000')]
for number, url in urls:
s = requests.Session()
response = s.get(url)
with open("/Users/aartimalik/Downloads/test/" number "_phptest.pdf", "wb") as f:
f.write(response.content)
f.close()
This code saves 0-byte pdfs because pdfs corresponding to those numbers do not exist. I want it to: save .pdf files only if there's a pdf file corresponding to an x file and return "no pdf file" if it doesn't exist...I'm not sure if it's possible with with open
. Any help is appreciated. Thanks!
CodePudding user response:
The following worked (can modify it to include pdfs):
import requests
import os
os.chdir("/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs")
from phpurldoc import urls
print(urls)
for number, url in urls:
s = requests.Session()
response = s.get(url)
h = response.headers["Content-Disposition"].split("=")[-1]
if h[-1] == "x":
with open("./bidsummaries-doc/" h "_" number ".docx", "wb") as f:
f.write(response.content)
f.close()
else:
with open("./bidsummaries-doc/" h "_" number ".doc", "wb") as f:
f.write(response.content)
f.close()