Home > other >  Python requests crawl headlines today, why don't get the web page content
Python requests crawl headlines today, why don't get the web page content

Time:11-18

Headers={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/53.0.2785.104 Safari 537.36/Core/1.53.4882.400 QQBrowser/9.7.13059.400 '
}
The response=requests. Get (' http://toutiao.com/group/6552087122092753412 'headers=headers)
Print (response. Status_code)
Print (response. The text)
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Results:
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
E: \ Python_Pro \ spiders \ venv \ Scripts \ python exe E:/Python_Pro toutiao/jiepai. Py
200
<body>

Process is over, the exit code 0
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Status_code was $200, it should be the request is successful, but why didn't the content, only & lt; Html> <body> , online direct copying my code to run a lot of people say can get content, because I often visit, was banned from the crawler mechanism???? O great god answers,

CodePudding user response:

There is no space between the user-agent: the user-agent

CodePudding user response:

 headers={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/53.0.2785.104 Safari 537.36/Core/1.53.4882.400 QQBrowser/9.7.13059.400 '
}
The response=requests. Get (' http://toutiao.com/group/6552087122092753412 'headers=headers)
Print (response. Status_code)
Print (response. The text)

CodePudding user response:

First remove the blank space can see the user-agent, or it should be the crawler

CodePudding user response:

really is the user-agent write wrong,
Help you analyzed: by the way, if you want to climb in the image,
Gallery: JSON parse (" {\ 'count \' : 9, \ "sub_images " : [{\ "url " : \ "http:\\/\\/p1.pstatp.com\\/origin\\/pgc-image\\/1525526621497af92b9b23b", \ "width " : 690,,, regular matching a string of good,
If the comments even,
https://www.toutiao.com/api/comment/list/? Group_id=6552087122092753412 & amp; Item_id=6552087122092753412 & amp; Offset=0 & amp; Count=5, this is the interface, a Get request, return the json data format,

CodePudding user response:

Site after robots. TXT and see what allows access to the agent

CodePudding user response:

The from requests_html import HTMLSession
The session=HTMLSession ()

R=session. Get (url, verify=False)

R.h. the TML. Render (sleep=1) # increase a waiting time

CodePudding user response:

I also am this problem, with the user-agent, and access can appear sometimes successfully, sometimes fail,
  • Related