Home > Enterprise >  Return non-zero exit code when raising a scrapy.exceptions.UsageError exception
Return non-zero exit code when raising a scrapy.exceptions.UsageError exception

Time:02-15

I have a Scrapy script which looks like this:

main.py

import os
import argparse
import datetime
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.mySpider import MySpider

parser = argparse.ArgumentParser(description='My Scrapper')
parser.add_argument('-v',
                    '--verbose', 
                    help='Verbose mode',
                    action='store_true')
parser.add_argument('-t', 
                    '--type', 
                    help='Type',
                    type=str)

args = parser.parse_args()

if args.type != 'expected':
    parser.error("Wrong type")

if __name__ == "__main__":
    settings = get_project_settings()
    settings['LOG_ENABLED'] = args.verbose
    process = CrawlerProcess(settings=settings)
    process.crawl(MySpider, type_arg=args.type)
    process.start()

mySpider.py

from scrapy import Spider
from scrapy.http import Request, FormRequest
import scrapy.exceptions as ScrapyExceptions

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.webtoscrape.com']
    start_urls = ['http://www.webtoscrape.com/path/to/page.html']

    def parse(self, response):
        # ...
        # Some logic
        # ...

        if condition:
            raise ScrapyExceptions.UsageError(reason="Wrong argument")

When I raise a parser.error() on the main.py file, my process returns a non-zero exit code as expected. However, when I raise a scrapy.exceptions.UsageError() on the mySpider.py file, I receive a 0 exit code, so the Jenkins pipeline step I run my script on thinks it has succeded and continues with the pipeline execution. I run my script with a python3 main.py --type my_type command.

Why the script execution doesn't notice that the usage error raised on the mySpider.py module should return a non-zero exit code?

CodePudding user response:

After several hours of trying approaches and reading this issue, the problem was that Scrapy does not use a non-zero exit code when a scrape fails. I managed to fix this behaviour by using the Crawler stats collection.

main.py

if __name__ == "__main__":
    settings = get_project_settings()
    settings['LOG_ENABLED'] = args.verbose
    process = CrawlerProcess(settings=settings)
    process.crawl(MySpider, type_arg=args.type)
    crawler = list(process.crawlers)[0]
    process.start()

    failed = crawler.stats.get_value('custom/failed_job')
    if failed:
        sys.exit(1)

mySpider.py

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.webtoscrape.com']
    start_urls = ['http://www.webtoscrape.com/path/to/page.html']

    def parse(self, response):
        # ...
        # Some logic
        # ...

        if condition:
            self.crawler.stats.set_value('custom/failed_job', 'True')
            raise ScrapyExceptions.UsageError(reason="Wrong argument")
  • Related