Home > Enterprise >  Downloading PDF's behind HTTPS with requests/BeautifullSoup wont work
Downloading PDF's behind HTTPS with requests/BeautifullSoup wont work

Time:11-09

I am trying to accomplish the following: -Find all .PDF files on a webpage that requires a login -Rename the .PDF files to only have the files name and not full URL -Create a folder on the local users desktop -Only download files that isn't already present in the created folder -Download given .PDF files to the new folder

The code below logs into the website and retrieves all the .PDF files, slashes the name to be only the file name and downloads them to the folder. However all off the files seem to be corrupt(cant be opened)

Any kind of feedback or recommendations on how to fix it would be appreciated. (Payload has been altered as to not give any credentials away)


Additional information:


Sampleurl is the main page of the website after logging in. Loginurl is the page where users get authenticated secure_url is the page containing all the .PDF's that i want to download



Code:

# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request

# Fetch username
username = os.getlogin()    

# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)

Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')



payload = {
    'username': 'xxxx',
    'password': 'xxx',
    'ltfejs': 'xx'
    
}



  
with requests.session() as s:
    print("Connecting to website")
    s.post(loginurl, data=payload)
    r = s.get(secure_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('a', href=re.compile(r'(.pdf)'))


    print("Gathering .PDF files")
    # clean the pdf link names
    url_list = []
    for el in links:
        if(el['href'].startswith('https:')):
            url_list.append(el['href'])
        else:
            url_list.append(Sampleurl   el['href'])
    
    pprint(url_list)


    
    print("Downloading .PDF files")
        
    # download the pdfs to a specified location
    for url in url_list:
        print(url)
        fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        print(fullfilename)
        request.urlretrieve(Sampleurl,fullfilename)

     
            
print("This program will automatically close in 5 seconds ")
time.sleep(5)

Output

Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
 'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\quickscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\fullscan.pdf
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\improvementscan.pdf
https://www.tict.io/downloads/privacylabel.pdf
C:\Users\MATH\desktop\Vodafone_Invoices\privacylabel.pdf
This program will automatically close in 5 seconds 

It does download a working .PDF when manually clicking on one of the hyperlinks in the output.


EDIT

I've adjusted my code and now it does download a working PDF to the allocated folder, however it only takes the last file in the list and wont repeat the cycle for the others

    print("Downloading .PDF files")
        
    # download the pdfs to a specified location
    for PDF in url_list:
        fullfilename = os.path.join(r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username), url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        myfile = requests.get(PDF) 
        open(fullfilename, 'wb').write(myfile.content)
        

print("This program will automaticly close in 5 seconds ")
time.sleep(5)

Only privacylabel.pdf (last file in the url_list) gets downloaded. The others wont appear in the folder. When printing PDF it also only returns the privacylabel.pdf

CodePudding user response:


Working

I forgot to call upon the session as s

myfile = requests.get(PDF)

should have been

myfile = s.get(PDF)

Working code for those interested:

# Import libraries
import requests
from bs4 import BeautifulSoup
import os
from pprint import pprint
import time
import re
from urllib import request
from urllib.parse import urljoin
import urllib.request


# Fetch username
username = os.getlogin()    

# Set folder location to local users desktop
folder_location = r'C:\Users\{0}\desktop\Vodafone_Invoices'.format(username)

Sampleurl = ('https://www.tict.io')
loginurl =('https://www.tict.io/auth/login')
secure_url = ('https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca')


    

Username = input("Username: ")
Password = input("Password: ")

payload = {
    'username': (Username),
    'password': (Password),
    'ltfejs': 'xxx'
    
}

  
with requests.session() as s:
    print("Connecting to website")
    s.post(loginurl, data=payload)
    r = s.get(secure_url)
    soup = BeautifulSoup(r.content, 'html.parser')
    links = soup.find_all('a', href=re.compile(r'(.pdf)'))

    print("Gathering .PDF files")
    # clean the pdf link names
    url_list = []
    for el in links:
        if(el['href'].startswith('https:')):
            url_list.append(el['href'])
        else:
            url_list.append(Sampleurl   el['href'])
    
    pprint(url_list)

   
    
    print("Downloading .PDF files")
    
# download the pdfs to a specified location
    for url in url_list:
        fullfilename = os.path.join(folder_location, url.split("/")[-1])
        if not os.path.exists(folder_location):os.mkdir(folder_location)    
        myfile = s.get(url)
        print(url)
        print("Myfile response:",myfile)
        open(fullfilename, 'wb').write(myfile.content)
                

print("This program will automaticly close in 5 seconds ")
time.sleep(5)

Output

Username: xxxx
Password: xxxx
Connecting to website
Gathering .PDF files
['https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf',
 'https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf',
 'https://www.tict.io/downloads/privacylabel.pdf']
Downloading .PDF files
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/quickscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/fullscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/tool/87dd1218-f632-4ddb-b4d2-1f195bb4a5ca/improvementscan.pdf
Myfile response: <Response [200]>
https://www.tict.io/downloads/privacylabel.pdf
Myfile response: <Response [200]>
This program will automatically close in 5 seconds 

Conclusion

  1. I had to call upon the session as s, since I forgot to do that the files could not be reached
  2. I had to alter the download code a bit since the original tried downloading with urlretrieve and not requests
  • Related