I am trying to crawl the website https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW but I get (410) error
INFO: Ignoring response <410 https://www.rightmove.co.uk/properties/105717104>: HTTP status code is not handled or not allowed
I am just trying to find the properties that have been sold using the notification on the page "This property has been removed by the agent."
I know the website has not blocked me because I am able to use the scrappy shell to get the data and also view(response) works fine too, I can directly go to the same URL using web browser so the 410 doesn't make sense I can also crawl pages from the same domain, (ie) the pages without the notification "This property has been removed by the agent."
Any help would be much appreciated.
CodePudding user response:
Seem's the when a listing has been marked as removed by and agent on Rightmove then the website will return status code 410 Gone
(Which is quite weird). But to solve this, simply do something like this in your request:
def start_requests(self):
yield scrapy.Request(
url='https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW',
meta={
'handle_httpstatus_list': [410],
}
)
EDIT
Explanation: Basically, Scrapy will only handle the status code from the response is in the range 200-299
, since 2XX
means that it was a successful response. In your case, you got a 4XX
status code which means that some error happened. By passing handle_httpstatus_list = [410]
we tell Scrapy that we want it to also handle 410
responses and not only 200-299
.
Here is the docs: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#std-reqmeta-handle_httpstatus_list