Home > Blockchain >  How to download PDF from url in python
How to download PDF from url in python

Time:11-07

Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.

This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262# does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.

I want to be able to call the url with a variable for the query param TRACKNO and to be able to save the pdf file using python.

I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:

import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time

def extract_url(driver):
    advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
    print(advice_requests)
    for request in advice_requests:
        if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
            link_split = request['name'].split('-')
            if(link_split[-1] == 'filedownload=X'):
                print("..... Successful")
                return request['name']
    print("..... Failed")

def save_advice(advice_url,tracking_num):
    requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
    response = requests.get(advice_url,verify=False)

    with open(f'{tracking_num}.pdf', 'wb') as f:
        f.write(response.content)

def get_payment_advice(tracking_nums):
    options = webdriver.ChromeOptions()
#   options.add_argument('headless')  # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
    driver = webdriver.Chrome(options=options)
    
    for num in tracking_nums:
        print(num,end=" ")
        driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
        try:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
            time.sleep(0.1)
            advice_url = extract_url(driver)
            save_advice(advice_url,num)
        except:
            pass
    driver.quit()

get_payment_advice['8262']

As it can be seen I get all the network calls that the browser makes in the first line of the extract_url function and then parse each request to find the correct one. However this does not work in headless mode

Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?

CodePudding user response:

I fixed it, i only changed one function. The correct url is in the given page_source of the driver (with beatuifulsoup you can parse html, xml etc.):

from bs4 import BeautifulSoup

def extract_url(driver):
    soup = BeautifulSoup(driver.page_source, "html.parser")
    object_element = soup.find("object")
    data = object_element.get("data")

    return f"https://webice.ongc.co.in{data}"

The hostname part may can be extracted from the driver. I think i did not changed anything else, but if it not work for you, I can paste the full code.

Old Answer:

if you print the text of the returned page (print(driver.page_source)) i think you would get a message that says something like: "Because of your system configuration the pdf can't be loaded"

This is because the requested site checks some preferences to deside if you are a roboter or not. Maybe it helps to change some arguments (screen size, user agent) to fix this. Here are some information about, how to detect a headless browser.

And for the next time you should paste all relevant code into the question (imports) to make it easier to test.

  • Related