Home > Mobile >  How to check response status for http error codes using Scrapy?
How to check response status for http error codes using Scrapy?

Time:10-21

I want to check the response status and export it to CSV file using Scrapy. I tried with response.status but it only shows '200' and exports to the CSV file. How to get other status codes like "404", "502" etc.

def parse(self, response):
        yield {
            'URL': response.url,
            'Status': response.status
        }

CodePudding user response:

In your settings you can adjust these to make sure certain error codes are not automatically filtered by scrapy.

HTTPERROR_ALLOWED_CODES

Default: []

Pass all responses with non-200 status codes contained in this list.

HTTPERROR_ALLOW_ALL

Default: False

Pass all responses, regardless of its status code.

settings.py


HTTPERROR_ALLOW_ALL = True

HTTPERROR_ALLOWED_CODES = [500, 501, 404 ...]

CodePudding user response:

You can add an errback to the request and then catch the http error in the errback function and yield the required information. Get more information about the errback function in the docs. See sample below

import scrapy
from scrapy.spidermiddlewares.httperror import HttpError


class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request(url="https://example.com/error", errback=self.parse_error)

    def parse_error(self, failure):
        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            yield {
                'URL': response.url,
                'Status': response.status
            }

    def parse(self, response):
        yield {
            'URL': response.url,
            'Status': response.status
        }
  • Related