Home > other >  Python crawl PDF is empty file
Python crawl PDF is empty file

Time:05-26

Web pages need to register, I have already registered, login information also bring to parameters, but access to the PDF is 543 bytes of the file, unable to open, bosses to reassure
 
#! The/usr/bin/env python
# - * - coding: utf-8 - * -
"" "
Author: xiaofeng
Date:
25/5/2021 8:23 morningPython version: 3.7.4
"" "
The import sys
Sys. Path. Append ("./")
The import time
The import requests
The from bs4 import BeautifulSoup
The import urllib3
Urllib3. Disable_warnings ()
The from utils. Mysql_database import MySQL


cookies={'ASP.NET_SessionId': 'ltwutnzgoksolnh5plvbgom5', '__jsluid_h': '0dcd428869d5ddfa9f93734877768ebd', 'ADHOC_MEMBERSHIP_CLIENT_ID1.0': '1bdb8472-8581-4d57-1f15-6886e481c94e', 'hxck_cd_sourceteacher': 'sR/uPcnSSZVIdShwHag3RAnrY9aauRbMjEnRBtq/NF1ooDP7obDVPgaQGWxsj76JKxyO3FI4Vc44fDgs6a1RPPpEobT+rbs69gkUhZXkQRoriNmMhXYGi/Kg9Ge0W8OrEI7l3vb8+Lqxxc3S7ShzRhKqh6bc78GcuOlpZb32nz0skyJDPkyAQqaKNc5MYiP97iqqVoya6Js', 'hxck_cd_channel_order_mark1': '4001000001_auto', 'hxck_cd_channel': 'tKK6EMkJ7JK75WOJ/qluxbbMrhZQZtn9if6/Tggkwv0C9bkEdyIEtGS3AuATPaYjdIt/JM0butmcm0ugDRAOQLZ9K84yt3NDkdLWbt3B6UKb6vH2CyPMpRzCakALWJ0ET22e0EDh5Fg', 'HexunTrack': 'SID', 'UM_distinctid': '179a0ef15f934a-095394371ec4c1-37607201-1fa400-179a0ef15faba8', 'CNZZDATA1261865322': '1800598666-1621899014-|1621899014', 'Hm_lvt_cb1b8b99a89c43761f616e8565c9107f': '1621902694', 'hxck_fsd_lcksso': '765785F554A89C312D8941DA89784E8DA6F07A4DD0C8A7A01756B75532CBC92A9AF61DA2A10903428FD9C7FC05571766C495F9C6E6533A8E8A3774708CE623A80C4187A3EF72C130918420129BC6A79A375822E3B617E696711F30060E06725F0C072AC7F0B58AD6C801EDC29EFE6355DD132D6C15AD991C443F101A5098B016DB8260A7CC55FD1BA2761CB14AE42A503C7B409ADA0E30796F444D8FDFCAA827CBE7B840C02B95B605E8BCA36760606A', 'appToken': 'pc,other,chrome,hxAppSignId69134280520491461621903400822,PCDUAN', 'Hm_lvt_81ff19c9eb1c05cdfeacb05d2036f066': '1621903401', 'Hm_lpvt_81ff19c9eb1c05cdfeacb05d2036f066': '1621903401', 'cn_1263247791_dplus': '{"distinct_id": "179a0ef15f934a-095394371ec4c1-37607201-1fa400-179a0ef15faba8","userFirstDate": "20210525","userID": "32292139","userName": "wxmi92r4bx","userType": "loginuser","userLoginDate": "20210525","$_sessionid": 0,"$_sessionTime": 1621903815,"$dp": 0,"$_sessionPVTime": 1621903815}', 'userToken': '32292139|0000|0,dG5BpY0afu0XRdGr7cu5AGxVNVoAoU8uZn4D7HR9oE1VDxJlPLV8ZpSB3QtLylj5AuQOUmUm8SVmFm4j4T042k/BD9a/HRLPKNbXRpWJJ8I7ycZWKhCuPuwxEPmr+YMNy1T9WDYekIcDRnZyvQOGmOv2hwOJo8+4xJawWZl5+PdmIUHsA2NwuPahwH9JbfSEB7XcDuZCTttz/qQTmpZLUQ', 'hxck_sq_common': 'SnapCookie', 'Hm_lpvt_cb1b8b99a89c43761f616e8565c9107f': '1621903840'}


Mysql=mysql (" test_db ")
Sql_list=list ()


Def down_load_pdf (url, name, headers_pdf) :
Data=https://bbs.csdn.net/topics/url.split ("?" ) [1]
Post_data=(https://bbs.csdn.net/topics/dict)
Verifycode=data. The split (" & amp;" ) [0]. The split ("=") [1]
Fn=data. The split (" & amp;" ) [1] [3:]
Post_data [" verifycode "]=verifycode
Post_data [" fn "]=fn
The response=requests. Get (url=url, headers=headers_pdf data=https://bbs.csdn.net/topics/post_data, cookies=cookies)
With the open (" download_pdf/company_research/{}. PDF ". The format (name), "wb") as f:
F.w rite (response. The content)


Def the main () :
Url_info="http://yanbao.stock.hexun.com/dzgg1022315.shtml",
"Http://yanbao.stock.hexun.com/dzgg1022288.shtml",
"Http://yanbao.stock.hexun.com/dzgg1022286.shtml",
"Http://yanbao.stock.hexun.com/dzgg1022276.shtml",
"Http://yanbao.stock.hexun.com/dzgg1022267.shtml",
"Http://yanbao.stock.hexun.com/dzgg1023589.shtml")
For the url in url_info:
Url_page=1
Source_page="http://yanbao.stock.hexun.com/listnews1_" + STR (url_page) + ". SHTML "
Headers_pdf={
"The Host" : "yanbao.stock.hexun.com",
"Referer" : source_page,
"The Upgrade - the Insecure - Requests" : "1",
"The user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10 _15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 "
}
Pdf_name=url. Replace (" http://yanbao.stock.hexun.com/", ""). The replace (". SHTML", "")
The response=requests. Get (url=url, headers=headers_pdf, cookies=cookies)
Pdf_html=response. Content. decode (" GBK ")
Soup_=BeautifulSoup (pdf_html, "HTML parser")
For p_tag in soup_. Find_all (class_='a', 'check - PDF) :
Pdf_url=p_tag. Attrs [' href '] # href attribute, grab the PDF url information
If (" verifycode "and" fn ") in pdf_url:
Print (pdf_url)
Down_load_pdf (pdf_url pdf_name, headers_pdf)
The else:
Print (" PDF url access failed: "+ pdf_url)
Time. Sleep (3)


If __name__=="__main__" :
The main ()


CodePudding user response:

I climb took http://yanbao.stock.hexun.com/dzgg1042410.shtml directly the PDF of the page, you should change the cookies to your,
 import requests 
The from LXML import etree


Cookie='UM_distinctid a54f0b10d=17887-06 b524a5fabe19 a54f0c1ea fa400-17887-5771031-1; nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related