Home > Enterprise >  Scrapy: custom loggers appear to be overriding Scrapy loggers and silencing Scrapy logger output
Scrapy: custom loggers appear to be overriding Scrapy loggers and silencing Scrapy logger output

Time:11-16

I am in the process of trying to integrate my own loggers with my Scrapy project. The desired outcome is to log output from both my custom loggers and Scrapy loggers to stderr at the desired log level. I have observed the following:

  • Any module/class that uses its own logger seems to override the Scrapy logger, as Scrapy logging from within the related module/class appears to be completely silenced.
    • The above is confirmed whenever I disable all references to my custom logger. For exmaple, if I do not instantiate my custom logger in forum.py, Scrapy packages will resume sending logging output to stderr.
  • I've tried this both with install_root_handler=True and install_root_handler=False, and I don't see any differences to the logging output.
  • I have confirmed that my loggers are being properly fetched from my logging config, as the returned logger object has the correct attributes.
  • I have confirmed that my Scrapy settings are successfully passed to CrawlerProcess.

My project structure:

.
└── scraper
    ├── logging.cfg
    ├── main.py
    ├── scraper
    │   ├── __init__.py
    │   ├── forums.py
    │   ├── items.py
    │   ├── pipelines.py
    │   ├── runner.py
    │   ├── settings.py
    │   ├── spiders
    │   │   ├── __init__.py
    │   │   ├── forum.py
    │   │   └── someforum.py
    │   └── utils.py
    └── scrapy.cfg

The program is designed to be called from main.py.

Contents of main.py:

import os

from scraper import runner, utils
from scrapy.utils.project import get_project_settings

logger = utils.get_logger("launcher")


def main():
    """Entrypoint to the web scraper."""

    # Guarantee that we have the correct reference to the settings file
    os.environ.setdefault("SCRAPY_SETTINGS_MODULE", "scraper.settings")

    logger.info("Initializing spiders")
    runner.run_spiders(get_project_settings())


if __name__ == "__main__":
    main()

Contents of runner.py:

from scraper import forums, utils
from scraper.spiders.someforum import SomeForum
from scrapy.crawler import CrawlerProcess

logger = utils.get_logger("launcher")


def run_spiders(project_settings):

    process = CrawlerProcess(project_settings, install_root_handler=False)
    logger.info(
        f"Initialzing spider for {forums.someforum.get('forum_attrs').get('name')}"
    )
    process.crawl(
        SomeForum,
        **forums.someforum.get("forum_attrs"),
        post_attrs=forums.someforum.get("post_attrs"),
    )

    process.start()

Contents of logging.cfg:

[loggers]
keys=root,launcher,forum,item,pipeline

[logger_root]
level=DEBUG
handlers=basic

[logger_launcher]
level=DEBUG
handlers=basic
qualname=launcher
propagate=0

[logger_forum]
...

[logger_item]
...

[logger_pipeline]
...

# --

[logger_scrapy]
level=DEBUG

# -- 

[handlers]
keys=basic

[formatters]
keys=basic

[handler_basic]
class=StreamHandler
level=DEBUG
formatter=basic

[formatter_basic]
format=%(asctime)s - [%(name)s] - %(levelname)s - %(message)s

Note that I want to be able to configure module-level logging verbosity outside of a Python file. That's why there's a random dummy logger for Scrapy in logging.cfg, so that I can go grab it in the settings.py file and pass it to the underlying CrawlSpider.

Contents of settings.py file:

import configparser

config = configparser.RawConfigParser()
config.read(LOG_CONF)

LOG_LEVEL = config["logger_scrapy"].get("level")
LOG_FORMAT = config["formatter_basic"].get("format")

BOT_NAME = "scraper"
SPIDER_MODULES = ["scraper.spiders"]
NEWSPIDER_MODULE = "scraper.spiders"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 8
ITEM_PIPELINES = {
    "scraper.pipelines.PostPipeline": 300,
}

Function utils.get_logger():

import logging

from settings import LOG_CONF
from logging import config

def get_logger(logger_name):
    """Returns logger"""

    # LOG_CONF just points to `logging.cfg` in root directory
    config.fileConfig(LOG_CONF)
    return logging.getLogger(logger_name)

If I run main.py with my launcher logger, the logging output will contain stuff like:

2021-11-12 16:37:23,994 - [launcher] - INFO - Initializing spiders
2021-11-12 16:37:24,016 - [launcher] - INFO - Initialzing spider for someforum
2021-11-12 16:37:24,045 - [scrapy.extensions.telnet] - INFO - Telnet Password: 62df42034f3a7f09
2021-11-12 16:37:24,070 - [scaper.spiders.forums] - INFO - Creating scrape time log directory for forum: 'someforum'
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post title
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post currency
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post price
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Searching post title for tags
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post hash key

If I run main.py, but remove my launcher logger from both main.py and runner.py, the logging output will contain stuff like:

2021-11-12 16:39:37,734 - [scrapy.utils.log] - INFO - Scrapy 2.5.1 started (bot: scraper)
2021-11-12 16:39:37,735 - [scrapy.utils.log] - INFO - Versions: lxml 4.6.4.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.2 (default, Dec 21 2020, 15:06:04) - [Clang 12.0.0 (clang-1200.0.32.29)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform macOS-10.15.7-x86_64-i386-64bit
2021-11-12 16:39:37,735 - [scrapy.utils.log] - DEBUG - Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-12 16:39:37,748 - [scrapy.crawler] - INFO - Overridden settings:
{'BOT_NAME': 'scraper',
'CONCURRENT_REQUESTS': 8,
'LOG_FORMAT': '%(asctime)s - [%(name)s] - %(levelname)s - %(message)s',
'NEWSPIDER_MODULE': 'scraper.spiders',
'SPIDER_MODULES': ['scraper.spiders']}
2021-11-12 16:39:37,769 - [scrapy.extensions.telnet] - INFO - Telnet Password: 2baaa967d9d68933
2021-11-12 16:39:37,792 - [scrapy.middleware] - INFO - Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']

Scrapy logging from my call to CrawlerProcess.crawl() is once again logged to stderr, which is great. I just don't want my logs and Scrapy logs to be mutually-exclusive.

My gut tells me I'm messing up the root logger somehow, or my own loggers are overriding Scrapy loggers. I'm out of ideas on how else I can troubleshoot this, so any thoughts/suggestions are appreciated. Thank you in advance!

CodePudding user response:

I finally figured this out. TLDR: calling fileConfig() disabled all existing loggers by default, which is how I was instantiating my logger objects in my get_logger() function. Calling this as fileConfig(conf, disable_existing_loggers=False) resolves the issue, and now I can see logging from all loggers.


I decided to drill down a bit further into Python and Scrapy source code, and I noticed that any logger object called by Scrapy source code had disabled=True, which clarified why nothing was logged from Scrapy.

The next question was "why the heck are all Scrapy loggers hanging out with disabled=True?" Google came to the rescue and pointed me to a thread where someone pointed out that calling fileConfig() disables all existing loggers at the time of the call.

I had initially thought that the disable_existing_loggers parameter defaulted to False. Per the Python docs, it turns out my thinking was backwards.

Now that I've updated my get_logger() function in utils.py to:

import logging 

from settings import LOG_CONF
from logging import config

def get_logger(logger_name):
    """Returns logger"""

    config.fileConfig(LOG_CONF, disable_existing_loggers=False)
    return logging.getLogger(logger_name)

Everything works like a charm.

  • Related