Home > other >  The crawler code query problem
The crawler code query problem

Time:12-01

Want to through the crawler access to me-too1980.taobao.com this shop all the name of the goods, unit price and sales, and then write the following code, don't know why always can't grab, regular expressions to pull out your HTML inside information alone can do grab test, but directly in HTML or not, who is a great god can help us take a look at the reasons, thank you very much

GBK # coding=
The import requests
The import re


Def getHtmlText (url) :
Try:
Head_new={
'authority' : 'me-too1980.taobao.com',
'method' : 'GET',
.
The agent ':' Mozilla/5.0 (Windows NT 10.0; Win64. X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47 ',
'x - requested - with' : 'the XMLHttpRequest'
}
R=requests. Get (url, headers=head_new)
R.r aise_for_status ()
R.e ncoding=of state Richard armitage pparent_encoding
Return r.t ext
Except:
Print (" crawl failure ")
Return ""


Def parsePage (ilist, HTML) :
Try:
Goods_name=re. The.findall (r '& lt; Img Alt=". *?" ', HTML)
Goods_price=re. The.findall (r 'c - price & gt;" \ d + \ \ d * & lt; ', HTML)
Goods_sale_count=re. The.findall (r 'sale - num "& gt; \ d + & lt; ', HTML)
For I in range (len (goods_name) :
Price=eval (re. The split (r '[& gt; |], goods_price [I]) [1])
Sale_count=eval (re. The split (r '[& gt; | & lt;] ', goods_sale_count [I]) [1])
Name=goods_name [I]. Split (' \ ') "[1]
Ilist. Append ([name, price, sale_count])
Except:
Print (" parse error ")


Def printGoodsList (ilist) :
Print ("=====================================================================================================")
TPLT="{0: & lt; {3} \ t 1: & lt; {70} \ t 2: & lt; 6} \ t {3: & lt; 6}
"Print (TPLT. The format (" serial number ", "product name", "price", "sales"))
The count=0
For g in ilist:
Count +=1
Print (TPLT. The format ([0] count, g, g [1], [2] g, g [3]))
Print ("=====================================================================================================")


Def the main () :
The depth=2
Start_url="HTTP://https://me-too1980.taobao.com/i/asynSearch.htm? & The callback=jsonp137 & amp; Mid - 22507069265-0=w & amp; Wid=22507069265
"InfoList=[]
For I in range (the depth) :
Try:
Url=start_url + '& amp;='+ pageNo STR (1 + I)
HTML=getHtmlText (url)
ParsePage (infoList, HTML)
Except:
The continue

PrintGoodsList (infoList)


main()
  • Related