I want to scrape a website which scrapes information of given webpages every 5 minutes. I implemented this by adding a sleep time of 5 minutes in between a recursive callback, like so:
def _parse(self, response):
status_loader = ItemLoader(Status())
# perform parsing
yield status_loader.load_item()
time.sleep(5)
yield scrapy.Request(response._url,callback=self._parse,dont_filter=True,meta=response.meta)
However, adding time.sleep(5) to the scraper seems to mess with the inner workings of scrapy. For some reason scrapy does send out the request, but the yield items are not (or rarely) outputted to the given output file.
I was thinking it has to do with the request prioritization of scrapy, which might prioritize sending new request over yielding the scraped items. Could the be true? I tried to edit the settings to go from the Depth-first queue to a breadth-fire queue. This did not solve the problem.
How would I go about scraping a website for a given amount, let's say 5 minutes?
CodePudding user response:
It won't work because Scrapy
is asynchronous by default.
Try to set a corn job like this instead -
import logging
import subprocess
import sys
import time
import schedule
def subprocess_cmd(command):
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
proc_stdout = process.communicate()[0].strip()
logging.info(proc_stdout)
def cron_run_win():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run_linux():
# print('start scraping... ####')
logging.info('start scraping... ####')
subprocess_cmd('scrapy crawl <spider_name>')
def cron_run():
if 'win' in sys.platform:
cron_run_win()
schedule.every(5).minutes.do(cron_run_win)
elif 'linux' in sys.platform:
cron_run_linux()
schedule.every(5).minutes.do(cron_run_linux)
while True:
schedule.run_pending()
time.sleep(1)
cron_run()
This will run your desired spider every 5 mins depending on the os
you are using