I have a huge list of URLs and each one loads a different PDF document. This is one of them: https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0
It will most likely open the website home page in the first try, but if you paste the link again it will open a pdf document.
I'm trying to write a python script to download those documents locally to extract contnet using tika, but this behavior where it opens the home page the first time is throwing a wrench in anything I try.
1. I tried requests, but expectedly it just returns the HTML content of home page
import requests
from tika import parser
link = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx DocumentFragmentID=74223655&CheckDocumentGroups=0"
resp = requests.get(link)
with open('metadata.pdf', 'wb') as f:
f.write(resp.content)
raw = parser.from_file('metadata.pdf', xmlContent=False)
print(raw['content'])
output:
\n\n\n\n\n\n\n\n\n\n \n \t\t\n\n\t\tSkip to Main Content\xa0\xa0\xa0\xa0Logout\xa0\xa0\xa0\xa0My
Account\xa0\xa0\xa0\xa0\t\t\tHelp\n\n\n\n\n\n\n\t\t\t\nSelect a location\nPinellas County\n\n\xa0\nAll Case
Records Search\nCivil, Family Case Records\nCriminal & Traffic Case Records\nProbate Case Records\nCourt
Calendar\n\nAttorney Login\nRegistered User Login\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\n\t\t\
t\xa0\t\n\t\n\t\tClerk of the Circuit Court|Mortgage Foreclosure Sales|Pinellas County Government|Pinellas
County Sheriff's Office|Public Defender|Sixth Judicial Circuit|State of Florida|State Attorney|Self Help
Center|Court Forms|How-To Videos|Florida Courts eFiling Portal Video|Attorney Account Setup|Reports and
Statistics|Terms of Use|Contact UsCopyright 2003 Tyler Technologies. All rights Reserved.\n\t\n\n\n\n
\n
2. I tried to open the home page using Selenium, and transfer cookies from the webdriver to requests following this answer .
url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"
driver.get(url)
cookies = driver.get_cookies()
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
resp = s.get(url)
It did not work, and when I checked the CookieJar of the response object it came out empty. I have to admit I have so little understanding of how cookies work, but it was just a desperate attempt. What am I misunderstanding here? I appreciate any input.
3. My last resort (for obvious reasons) was to open each document via webdriver and download the content, but even this did not work.
#opens a new window and assigns it as the working window
def open_window(driver, link):
driver.execute_script(f"window.open('{link}')")
new_window = driver.window_handles[-1]
driver.switch_to.window(new_window)
url = "https://ccmspa.pinellascounty.org/PublicAccess/ViewDocumentFragment.aspx?DocumentFragmentID=74223655&CheckDocumentGroups=0"
driver.get(url)
open_window(driver, url)
#print source of new window
print(driver.page_source)
The output is just this:
<html><head></head><body></body></html>
CodePudding user response:
After a little more tinkering, solution #2 worked. But instead of getting cookies from the driver after accessing the main page only, I had the browser start another query (with little extra steps specific to this website) then I used the cookies. It looks like this
[{'domain': 'ccmspa.pinellascounty.org',
'expiry': 1670679832, #this is the time the cookie expires in epoch time
'httpOnly': True,
'name': '.ASPXFORMSPUBLICACCESS',
'path': '/',
'secure': True,
'value': '1DBB1EADBA199D246E84CCE7243202DCA6BBD7E383FE360ECBFC2E6150102C79F3EC2F6B232B85589C51976AF20EF7EBDF52CF74122A7A6E78B4C6F31434C58AB57E10005C41DE019814B704F12B150A0818585E85F0237EFCF1A11B205414325CA1850605FF932BC43CC5B36395488F40D58DA594899C4D62FF3ECCBE729C6BC001194225B6653CB89C1305C7FBCB26E1BCFCFF75476784D24ADFCA0AFF679A3BAA3131'},
{'domain': 'ccmspa.pinellascounty.org',
'httpOnly': True,
'name': 'ASP.NET_SessionId',
'path': '/',
'secure': True,
'value': '24552pqtb1tomjbw2gkzko55'},
{'domain': 'ccmspa.pinellascounty.org',
'httpOnly': False,
'name': 'EDLFDCVM',
'path': '/',
'sameSite': 'None',
'secure': True,
'value': '02282de498-9595-48s0hGpl59SkUKRZpRrS_b1TKJfXlz_3dGN9xGZ2tcTXrHuDsR5rN90I_Rp192pX48C1k'}]