Home > other >  Of new about crawl from the post content and parses the crawler
Of new about crawl from the post content and parses the crawler

Time:09-23

Just learn a little bit of the crawler, want to write a creeper, to climb in designated post bar pictures,
Then use the requests. The get () after get web content,

Found etree. HTML parsing out only a small part of the web page, I can't use xpath to get what I want content,

Research for a long time, found that may be because the post bar to return to the web page content contains two HTML tags, so to the normal HTML cannot parse web pages for the Xpath object,

To solve this kind of situation we need to do?


 
The import requests
The from LXML import etree


Url="https://tieba.baidu.com/f? Kw=lol& Fr=ala0 & amp; The TPL=5 & amp; Pn=0 & amp;"
Headers={
"The user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36}
"The response=requests. Get (url, headers=headers)
The content=response. The content
HTML=etree. HTML (content)
Print (etree. Tostring (HTML). Decode ())

Run the code can be found, print out only a part of the web,

CodePudding user response:

No problem, change the idea, with regular to extract the url
  • Related