I have a problem with my Python script. When I just run my spider scrapy runspider Myspider
it's work but if I run it from the main file I have this error : KeyError: 'driver'
Settings file :
SELENIUM_DRIVER_NAME = 'chrome'
#SELENIUM_DRIVER_EXECUTABLE_PATH = '/home/PATH/OF/FILE/chromedriver'
SELENIUM_DRIVER_ARGUMENTS=['--headless']
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
My spider file :
class MySpider(scrapy.Spider):
name = 'my_spider'
def __init__(self, list_urls, *args, **kwargs):
super(my_spider, self).__init__(*args, **kwargs)
self.urls = list_urls
def start_requests(self):
for url in self.urls:
yield SeleniumRequest(
url = url['link'],
callback = self.parse,
wait_time = 15,
)
and my main file :
import scrapy
import classListUrls
from scrapy.crawler import CrawlerProcess
from dir.spiders import Spider
URL = "example.com"
urls = classListUrls.GenListUrls(URL)
process = CrawlerProcess()
process.crawl(Spider.my_spider, list_urls = urls.list_urls())
process.start()
I don't understand why this error.
CodePudding user response:
One problem I see is, the first parameter to process.crawl should be the spider class, instead of the spider name.
process.crawl(Spider.MySpider, list_urls=urls.list_urls())
And the same is true when you call the superclass in the spiders __init__
, although the better option would be to just leave it empty since the class is already the default.
class MySpider(scrapy.Spider):
def __init__(self, *args, list_urls=None,**kwargs):
super().__init__(*args, **kwargs)
Another thing is that the crawlerProcess needs to be constructed with a settings object because it doesn't read from the main settings.py file.
process = CrawlerProcess(settings={"SELENIUM_DRIVER_NAME": 'chrome',
"SELENIUM_DRIVER_ARGUMENTS": ['--headless'],
"DOWNLOADER_MIDDLEWARES": {'scrapy_selenium.SeleniumMiddleware': 800}})