I've got a little problem with my scrapy spider. So I set up scrapy and all is working fine but everytime I want to scrape a website I have to start the spider by myself. But I want it to be full automated and doesn´t know how to do.
Actually I start the spider with cmdline.execute. I thought I could simply write a while True loop but turns out it doesn´t work. And i found out, that the spider doesn´t really quit. Hard to explain. Pycharm says "Finished with exit code 0" but if i put a print("End of program") after the cmdline.execute it doesnt print out anything.
And at this point I'm confused what to do. Can you help me?
CodePudding user response:
There are many options for scheduling spiders.
CRON: Like Alexander commented you can create a CRON Job, it think this is best suited for a situation where you have just a few spiders that you're not gonna change the schedule for often.
Scrapydweb: It's a web interface for managing scrapyd. You must host it yourself. Quite easy to use in my experience.
Zyte: Practially the same as scrapydweb but it's a SaaS app that you do not host yourself. Very easy to use but expensive.
Gerapy: I have not tried it but I believe it's similar to scrapydweb but seems to be built on some more modern frameworks.
CodePudding user response:
Try using scrapyd.
Scrapyd is an application for deploying and running Scrapy spiders. It enables you to deploy (upload) your projects and control their spiders using a JSON API.
Some tutorials: