Home > other >  A python program, crawl poetry in some pages, due to no analytic item program can't continue to
A python program, crawl poetry in some pages, due to no analytic item program can't continue to

Time:05-09

The import requests
The from LXML import etree
The import OS
The from bs4 import BeautifulSoup
Import the random

If not OS. Path. The exists ('/' poetry) :
OS. The mkdir ('/' poetry)
Print (" the file has been established!!
")# camouflage browser
Agent1="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/31.0.1650.63 Safari/537.36 "
Agent2="HTC One Mozilla/5.0 (Linux; Android 4.0.3; HTC One X Build/IML74K) AppleWebKit/535.19 \
(KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19 "
Agent3="Galaxy S4 Mozilla/5.0 (Linux; U; The Android 4.2. Xx, xx. GT - I9500 Build/JDQ39) \
AppleWebKit/534.30 (KHTML, like Gecko) Version 4.0/Mobile Safari/534.30 "
Agent4="Mozilla/5.0 (the device; CPU OS 9 _1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) \
Version 9.0/Mobile/13 b143 Safari/601.1 "
Agent5="Mozilla/5.0 (Windows NT 6.1; WOW64. The rv: 53.0) Gecko/20100101 Firefox/53.0 "
List1=[agent1, agent2, agent3 agent4, agent5]
Agent=the random choice (list1)
Headers={
"The user-agent:" Agent
} # camouflage browser end!!!!!
Url='https://www.shicimingju.com'
Page_text=requests. Get (url, headers). Text#. Encode (' iso - 8859-1). The decode (' utf-8 ') # utf-8 for web charst attribute
The tree=etree. HTML (page_text)
List_li=tree. Xpath ('//* [@ id="main_right"]/div [1]/ul/li [1] ') # li [1] has brackets 1 is only one li tag
For li in list_li:
Url1=url + li. Xpath ('/a/@ href ') [0] # (each) was obtained from the home page of the author's address
# new_url=url + new_url
Zuozhe=li. Xpath ('/a/text () ') [0]
Print (zuozhe)
# print (url1)
Fp=open ('/poetry. Doc ', 'w', encoding="utf-8")
Url2=url1. Replace (' HTML ', '_ {}. HTML) # {} string replacement, get all the address of each author's poem,
# print (url2)
Pagetext1=requests. Get (url1, headers). The text
Soup=BeautifulSoup (pagetext1, 'LXML)
Number=soup. H1. # for h1 title text in the text content, and then extract the digital
Total=int (re. The.findall (' \ d + 'number) [0]) # regular extraction, the number in the string into an integer,
# print (total)
Page_number=int (total/20 + 1)
For I in range (1, page_number + 1) :
# print (url2. The format (I)) # traverse the total number of pages of poet's poetry
Page_text2=(requests. Get (url2. The format (I), headers). The text), encode (' iso - 8859-1). The decode (' utf-8)
Soup=BeautifulSoup (page_text2, 'LXML)
List_url3=soup. Select (' h3) # of poems in h3 tag, so we can get each poem list
For url3 list_url3 in:
Url3=url + url3. A [' href '] # for the independence of each poem url, for each individual page requests and crawl data,
Print (url3)
Page_text3=(requests. Get (url3, headers). The text), encode (' iso - 8859-1). The decode (' utf-8)
Soup=BeautifulSoup (page_text3, 'LXML)
The topic of # for poem
Title3=soup. H1. Text
# for the content of the poem
The content=soup. The find (' div 'class_="item_content"). The text
# for the interpretation of poetry, individual page no parsing error
Shangxi=soup. The find (' div 'class_="shangxi_content"). The text
If len (shangxi) & lt; 0:
The continue
Fp. Write (title3. Center (60) + zuozhe + + + '\ n' content ':' + + '\ n' shangxi)
Print (title3)

Print (' end, go and see the download file!! ')



  • Related