Home > other >  Python crawler, why can normal crawl data, but the program still go in order to prompt the error and
Python crawler, why can normal crawl data, but the program still go in order to prompt the error and

Time:05-22

 import requests 
The from LXML import etree
The import CSV
The from datetime import datetime
The import time

Def doSth () :
Try:
# 1. The target url,
Url="https://s.weibo.com/top/summary? Cate=realtimehot '
# simulation browser request header
Headers={
'the user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10 _15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 '}

# 2. Send the request
Data=https://bbs.csdn.net/topics/requests.get (url, headers=headers). The text
# transformation
HTML=etree. HTML (data)

# 3. Parse the data out xpath is a list of data
Rank #
Rank=HTML. Xpath ('//td/@/text () ')
# events
Affair=HTML. Xpath ('//td/@/a/text () ')
Affair. Pop (0) # ignored the microblogging hot search set-top recommendations, #. Pop (n) : delete list n + 1,
Heat #
View=HTML. Xpath ('//td/@/span/text () ')

# link
The link=HTML. Xpath ('//tr/td/a/@ href ')
Link_try=HTML. Xpath ('//tr/td/a/@ href_to ')
The link. Pop (0)
# processing link data (because links HTML position there may be a different place, so do the following judgment)
The index=0
For I, sku in enumerate (link) : # I and sku is here? Here I and as I finally saved the last code?
If sku=="javascript: void (0);" :
The link [I]=link_try [index]
Index +=1

# 4. Save the data as CSV,
Date=datetime. Now (). Strftime (' % % Y - m - m - H - % d % % % S ')
With the open ('/' + date + 'CSV', 'w', newline=', encoding='utf-8 - sig) as f:
Writer.=the CSV writer (f)
Writer. Writerow ([' no ', 'events',' heat ', 'links'])
For I, rank in enumerate (rank) :
Writer. Writerow ([rank, affair [I], view [I], 'https://s.weibo.com' + link [I]])
# 5. 120 seconds of sleep,
Time. Sleep (120)
Except:
Print (time. Strftime (" % Y - m - X "% d % %))
Print (" requests speed so high, need sleep!" )
Time. Sleep (10)
Print (" continue...
")
While True:
DoSth ()



Why can crawl data normally, but the program still go in order to prompt the error and write the except content?
And there are two error


Clearly why this several days not properly before?

CodePudding user response:

Not matching rules, analysis page, one or two hidden out, can't see on the surface, but now the match rules can match, there is no heat value but hide out, so lead to the back of the results, rankings, events, heat, links to four index less a part of the heat inside, the length of the list is not consistent, so complains, the solution: after the match the result of the hidden inside those events to delete, or give it with a heat value is 0

CodePudding user response:

The improved code (45-51 new) :

 
The import requests
The from LXML import etree
The import CSV
The from datetime import datetime
The import time
The import copy


Def doSth () :
Try:
# 1. The target url,
Url="https://s.weibo.com/top/summary? Cate=realtimehot '
# simulation browser request header
Headers={
'the user-agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10 _15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 '}

# 2. Send the request
Data=https://bbs.csdn.net/topics/requests.get (url, headers=headers). The text
# transformation
HTML=etree. HTML (data)

# 3. Parse the data out xpath is a list of data
Rank #
Rank=HTML. Xpath ('//td/@/text () ')
# events
Affair=HTML. Xpath ('//td/@/a/text () ')
Affair. Pop (0) # ignored the microblogging hot search set-top recommendations, #. Pop (n) : delete list n + 1,
Heat #
View=HTML. Xpath ('//td/@/span/text () ')

# link
The link=HTML. Xpath ('//td/a/@ href)
Link_try=HTML. Xpath ('//td/a/@ href_to)
The link. Pop (0)
# processing link data (because links HTML position there may be a different place, so do the following judgment)
The index=0
For I, sku in enumerate (link) : # I and sku is here? Here I and as I finally saved the last code?
If sku=="javascript: void (0);" :
The link [I]=link_try [index]
Index +=1

# 4. Save the data as CSV,
Date=datetime. Now (). Strftime (' % % Y - m - m - H - % d % % % S ')

# remove heat of hot search out (hidden)
Rank_new=copy. Deepcopy (rank)
For r in the range (len (rank_new) :
If not rank_new [r]. Isdigit () :
Rank. Remove (rank_new [r])
Del affair [r]
Del link [r]

With the open ('/' + date + 'CSV', 'w', newline=', encoding='utf-8 - sig) as f:
Writer.=the CSV writer (f)
Writer. Writerow ([' no ', 'events',' heat ', 'links'])
For I, rank in enumerate (rank) :
Writer. Writerow ([rank, affair [I], view [I], 'https://s.weibo.com' + link [I]])
# 5. 120 seconds of sleep,
Time. Sleep (120)
Except:
Print (time. Strftime (" % Y - m - X "% d % %))
Print (" requests speed so high, need sleep!" )
Time. Sleep (10)
Print (" continue...
")

While True:
DoSth ()


If my answer is helpful to you, please post for my score!
  • Related