Why doesn't the CrawlSpider collect links?-CodePudding

I am trying to run my first CrawlSpider, but the program terminates without any errors, while it does not return anything, it terminates with zero result. What's wrong with my code?

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class FagorelectrodomesticoSpider(CrawlSpider):
    name = 'fagorelectrodomestico.com'
    allowed_domains = ['fagorelectrodomestico.com']
    start_urls = ['https://fagorelectrodomestico.com']

rules = (
    Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)

def parse_item(self, response):
    for doc in response.css('a.file'):
        doclink = doc.css('::attr("href")').get()
        product = Product()
        product['model'] = response.css('h2.data__symbol::text').get()
        product['brand'] = 'Fagor'
        product['file_urls'] = [doclink]
        yield product

CodePudding user response：

The main problem is this page uses JavaScript to add all elements to HTML but Scrapy can run JavaScript. If you turn off JavaScript in browser and reload this page then you should see empty white page. But there is module scrapy_selenium which can use module Selenium to control real web browser which can run JavaScript (but it will run slower).

Other problem: your rule search links with product/ which I don't see on main page but I can see on pages with categories. But you don't need rule to load other pages and it can't get links product/ from subpages - so it needs another rule to get other links and send to callback parser (which in Spider loads page, searchs all links and checks rules on these links).

And it may need to add /en/ to starting url to get english version which has links with product/. Spanish version has links productos/.

Some code needed to use SeleniumRequest instead of standard Request - I took some code from source code of CrawlSpider and add it with changes.

I also used CrawlerProcess to run code without creating project - so everyone can simply copy it and run python script.py

It downloads files to folder full.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import scrapy_selenium

class FagorelectrodomesticoSpider(CrawlSpider):

    name = 'fagorelectrodomestico.com'

    allowed_domains = ['fagorelectrodomestico.com']
    start_urls = ['https://fagorelectrodomestico.com/en/']

    rules = (
        Rule(LinkExtractor(allow='/en/product/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow='/en/', deny='/en/product/'), callback='parse', follow=True),
    )

    def start_requests(self):
        print('[start_requests]')
        for url in self.start_urls:
            print('[start_requests] url:', url)            
            yield scrapy_selenium.SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        print('[parse] url:', response.url)
        
        for rule_index, rule in enumerate(self._rules):
            #print(rule.callback)
            for link in rule.link_extractor.extract_links(response):
                yield scrapy_selenium.SeleniumRequest(
                    url=link.url,
                    callback=rule.callback,
                    errback=rule.errback,
                    meta=dict(rule=rule_index, link_text=link.text),
                )
            
    def parse_item(self, response):
        print('[parse_item] url:', response.url)
        
        for doc in response.css('a.file'):
            doclink = doc.css('::attr("href")').get()
            product = {
                'model': response.css('h2.data__symbol::text').get(),
                'brand': 'Fagor',
                'file_urls': [doclink],
            }
            yield product
        

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},   # used standard FilesPipeline (download to FILES_STORE/full)
    #'FILES_STORE': '/path/to/valid/dir',  # this folder has to exist before downloading
    'FILES_STORE': '.',                   # this folder has to exist before downloading

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    #'SELENIUM_DRIVER_ARGUMENTS': ['-headless'], # '--headless' if using chrome instead of firefox
    'SELENIUM_DRIVER_ARGUMENTS': [],
    #'SELENIUM_BROWSER_EXECUTABLE_PATH': '',
    #'SELENIUM_COMMAND_EXECUTOR': '',
    
    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800}
})
c.crawl(FagorelectrodomesticoSpider)
c.start()

CodePudding user response：

From reading the docs it appears that this line could be the issue:

rules = (
    Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)

The docs say:

callback is a callable or a string (in which case a method from the spider object with that name will be used) to be called for each link extracted with the specified link extractor.

Your parse_item is a callable, not a method from a spider object. So, I think you should pass it in as a callable:

rules = (
    Rule(LinkExtractor(allow='product/'), callback=parse_item, follow=True),
)

Since Python reads from top-to-bottom, define parse_item() above the rules line:

def parse_item(self, response):
    for doc in response.css('a.file'):
        doclink = doc.css('::attr("href")').get()
        product = Product()
        product['model'] = response.css('h2.data__symbol::text').get()
        product['brand'] = 'Fagor'
        product['file_urls'] = [doclink]
        yield product


rules = (
    Rule(LinkExtractor(allow='product/'), callback=parse_item, follow=True),
)