Home > OS >  unable to crawl a website using scrappy but the same website can be requested and used using scrappy
unable to crawl a website using scrappy but the same website can be requested and used using scrappy

Time:10-20

I am trying to crawl the website https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW but I get (410) error

INFO: Ignoring response <410 https://www.rightmove.co.uk/properties/105717104>: HTTP status code is not handled or not allowed

I am just trying to find the properties that have been sold using the notification on the page "This property has been removed by the agent."

I know the website has not blocked me because I am able to use the scrappy shell to get the data and also view(response) works fine too, I can directly go to the same URL using web browser so the 410 doesn't make sense I can also crawl pages from the same domain, (ie) the pages without the notification "This property has been removed by the agent."

Any help would be much appreciated.

CodePudding user response:

Seem's the when a listing has been marked as removed by and agent on Rightmove then the website will return status code 410 Gone (Which is quite weird). But to solve this, simply do something like this in your request:

def start_requests(self):
    yield scrapy.Request(
        url='https://www.rightmove.co.uk/properties/105717104#/?channel=RES_NEW',
        meta={
            'handle_httpstatus_list': [410],
        }
    )

EDIT

Explanation: Basically, Scrapy will only handle the status code from the response is in the range 200-299, since 2XX means that it was a successful response. In your case, you got a 4XX status code which means that some error happened. By passing handle_httpstatus_list = [410] we tell Scrapy that we want it to also handle 410 responses and not only 200-299.

Here is the docs: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#std-reqmeta-handle_httpstatus_list

  • Related