Home > other >  No. Ask god to help modify the Python crawler code.
No. Ask god to help modify the Python crawler code.

Time:04-17

This is getting a newspaper, my novice level is limited, and changed to a few days won't content , please god help to change the code, thank you very much,

The import requests
The import bs4
The import OS
Import a datetime
The import time
Def fetchUrl (url) :
"'
Function: the url of the page, get web content and return
Parameters: the target page url
Returns: the target web HTML content
"'
Headers={
'accept' : 'text/HTML, application/XHTML + XML, application/XML. Q=0.9, image/webp image/apng, */*; Q=0.8 ',
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 ',
}
R=requests. Get (url, headers=headers)
R.r aise_for_status ()
R.e ncoding=of state Richard armitage pparent_encoding
Return r.t ext
Def getPageList (year, month, day) :
"'
Function: for expressing ideas of the day's newspaper linked list below
Parameters: year, month, day
"'
Url="https://www.shobserver.com/staticsg/res/html/journal/index.html? The date='+ year +' - '+ month +' - '+ day + & amp; Page=01 '
HTML=fetchUrl (url)
Bsobj=bs4. BeautifulSoup (HTML, '. The HTML parser)
PageList=bsobj. Find (' div ', attrs={' class ':' the dd - box}). Find_all (' dd ')
LinkList=[]
For page in pageList:
TempList=page. Find_all (' a ')
For temp in tempList:
The link=\ [" href "]
If 'index.html in the link:
Url='https://www.shobserver.com/staticsg/res/html/journal/' + link
LinkList. Append (url)
Return linkList
Def getTitleList (year, month, day, pageUrl) :
"'
Function: to obtain a list of a newspaper page article link
Parameters: year, month, day, the links in the layout of the
"'
HTML=fetchUrl (pageUrl)
Bsobj=bs4. BeautifulSoup (HTML, '. The HTML parser)
TitleList=bsobj. Find (' div ', attrs={' class ':' the dd - box news - list '}). Find_all (' dd ')
LinkList=[]
For the title in titleList:
TempList=title. Find_all (' a ')
For temp in tempList:
The link=\ [" href "]
If 'detail. HTML in the link:
Url='https://www.shobserver.com/staticsg/res/html/journal/' + link
LinkList. Append (url)
Return linkList
Def getContent (HTML) :
"'
Function: parsing HTML web page, get news articles content
Parameters: the HTML web page content
"'
Bsobj=bs4. BeautifulSoup (HTML, '. The HTML parser)
# get titled
Title=bsobj. Find_all (' div ', attrs={' class ':' con - the title '})
Content1='
For p1 in the title:
Content1 +=p1. Text + '\ n'
# print (content1)
Access content articles #
PList=bsobj. Find_all (' div ', attrs={' class ':' TXT - box})
The content='
For p in pList:
Content +=p.t ext + '\ n'
# print (content)
# returns the title + content
Resp=content1 + content
Return resp
Def saveFile (content, path and filename) :
"'
Function: to save the text content to a local file
Parameters: to save the content of the path and file name
"'
# if you don't have this folder, then automatically generated
If not OS. Path. The exists (path) :
OS. Makedirs (path)
# save file
With the open (filename path +, 'w', encoding="utf-8") as f:
F.w rite (content)
Def download_rmrb (year, month, day, destdir) :
"'
Function: dong zhong website news content, and stored in the specified directory
Parameters: year, month, day, file the root directory of the
"'
PageList=getPageList (year, month, day)
For page in pageList:
TitleList=getTitleList (year, month, day, page)
For the url in titleList:
# get news content
HTML=fetchUrl (url)
The content=getContent (HTML)
# generate save file path and file name
Temp=url. The split () '=' [2]. The split (' & amp; ') [0]. The split (' - ')
PageNo=\ [0]
TitleNo=\ [0] if int (temp [0]) & gt;=10 else '0' + \ [0]
Path=destdir + + year + month + day '/' + '/'
FileName=year + month + day + '-' + pageNo + '-' + titleNo + '. TXT '
# save file
SaveFile (content, path and fileName)
If __name__=="__main__ ':
"'
Main function: program entry
"'
# crawl specified date news
NewsDate=input (" please enter a date to crawl (20210916) format, such as: ')
Year=newsDate (0-4)
The month=newsDate [made]
Day=newsDate [3]
Download_rmrb (year, month, day, 'D: 02/CQRB')
Print (" crawl to complete: "+ year + month + day
  • Related