Is also based on ADSL dial-up, the difference is that need can take two sets of ADSL dial-up server, use both servers as the agent in the process of fetching,
Suppose there is A, B two ADSL dial-up server, crawlers in C running on the server, use A as A proxy to access the network, if in the process of fetching in case of blocking access agent will immediately switch to B, then A redial, if meet again is to switch to A proxy access is forbidden, B dialing again, and again, the figure below:
Dial the agent using A, B:
& amp; amp; lt; Img data - rawheight=& amp; Quot; 327 & amp; Quot; Data - rawwidth=& amp; Quot; 721 & amp; Quot; src=https://bbs.csdn.net/topics/"https://pic1.zhimg.com/50/9196e28cd8621a06cd0f0339f1fa765b_hd.jpg" class="origin_image useful - lightbox - thumb" width="721" data - the original="https://pic1.zhimg.com/9196e28cd8621a06cd0f0339f1fa765b_r.jpg" & amp; Gt;
Use the B as the agent, A dial:
& amp; lt; Img data - rawheight=& amp; quot; 327 & amp; quot; Data - rawwidth=& amp; quot; 721 & amp; quot; src=https://bbs.csdn.net/topics/"https://pic2.zhimg.com/50/7afaf540be23920733bc466ae3f6f651_hd.jpg" class="origin_image useful - lightbox - thumb" width="721" data - the original="https://pic2.zhimg.com/7afaf540be23920733bc466ae3f6f651_r.jpg" & amp; gt;
Code crawler (web) :
The import requests
Import the random
Pro=[' 122.152.196.126 ', '114.215.174.227', '119.185.30.75]
The head={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64 x64) AppleWebkit/537.36 (KHTML, like Gecko) chrome/58.0.3029.110 Safari/537.36 '
}
Url='http://www.whatismyip.com.tw/'
R=requests. Get (url, proxies={' HTTP ': the random choice (pro)}, headers=head)
R.e ncoding=of state Richard armitage pparent_encoding
Print (r.s tatus_code)
Print (r.t ext)
Other:
# coding=utf-8
The import requests
The import time
The from LXML import etree
Def getUrl () :
For I in range (33) :
Url='{} http://task.zbj.com/t-ppsj/p s5. HTML. The format (I + 1)
SpiderPage (url)
Def spiderPage (url) :
If the url is None:
Return None
Try:
Proxies={
'HTTP:' http://221.202.248.52:80 ',
}
User_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari 537.36/Core/1.53.4295.400 '
Headers={' the user-agent: user_agent}
HtmlText=requests. Get (url, headers=headers, proxies=proxies). The text
The selector=etree. HTML (htmlText)
TDS=selector. Xpath ('//*/@/table/tr ')
For td in TDS:
Price=td. Xpath ('/td/p/em/text () ')
A href=https://bbs.csdn.net/topics/td.xpath ('/td/p/a/@ href ')
Title=td. Xpath ('/td/p/a/text () ')
The subTitle=td. Xpath ('/td/p/text () ')
Deadline=td. Xpath ('/td/span/text () ')
Price=price [0] if len (price) & gt; 0 the else '
Title=title [0] if len (title) & gt; 0 the else '
Href=https://bbs.csdn.net/topics/href [0] if len (href)> 0 else '
The subTitle=the subTitle [0] if len (subTitle) & gt; 0 the else '
Deadline, deadline of=[0] if len (deadline) & gt; 0 the else '
The print price, the title, the href, subTitle, deadline
Print '-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --'
SpiderDetail (href)
Except the Exception, e:
Print 'wrong', e.m essage
Def spiderDetail (url) :
If the url is None:
Return None
Try:
HtmlText=requests. Get (url). The text
The selector=etree. HTML (htmlText)
AboutHref=https://bbs.csdn.net/topics/selector.xpath ('//* [@ id="utopia_widget_10"]/div [1]/div/div/div/p [1]/a/@ href ')
Price=selector. Xpath ('//* [@ id="utopia_widget_10"]/div [1]/div/div/div/p [1]/text () ')
Title=the selector. Xpath ('//* [@ id="utopia_widget_10"]/div [1]/div/div/h2/text () ')
ContentDetail=selector. Xpath ('//* [@ id="utopia_widget_10"]/div [2]/div/div [1]/div [1]/text () ')
PublishDate=selector. Xpath ('//* [@ id="utopia_widget_10"]/div [2]/div/div [1]/p/text () ')
AboutHref=https://bbs.csdn.net/topics/aboutHref [0] if len (aboutHref)> 0 else '# python ternary operations: to true if the result of the decision condition when else is false, the results of the
Price=price [0] if len (price) & gt; 0 the else '
Title=title [0] if len (title) & gt; 0 the else '
ContentDetail=contentDetail [0] if len (contentDetail) & gt; 0 the else '
PublishDate=publishDate [0] if len (publishDate) & gt; 0 the else '
Print aboutHref, price, the title, contentDetail, publishDate
Except:
Print 'error'
If '_main_ :
GetUrl ()
CodePudding user response:
This requirement is certainly looking for HTTP proxy, BAN cut agent immediately, your method premise is: dial-up will get different IP, what's more, if the operator level is LAN, no matter how to change, there is an IP server,CodePudding user response:
666666