Home > Software design >  Why scrapy image pipeline is not downloading images?
Why scrapy image pipeline is not downloading images?

Time:09-07

I am trying to download all the images from the product gallery. I have tried the mentioned script but somehow I am not able to download the images. I could manage to download the main image which contains an id. The other images from the gallery do not contain any id and I failed to download them.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BasicSpider(CrawlSpider):
    name = 'basic'
    allowed_domains = ['www.leebmann24.de']
    start_urls = ['https://www.leebmann24.de/bmw.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='category-products']/ul/li/h2/a"), callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):

        yield {
            'URL': response.url,
            'Price': response.xpath("normalize-space(//span[@class='price']/text())").get(),
            'image_urls': response.xpath("//div[@class='item']/a/img/@src").getall()
        } 

CodePudding user response:

@Raisul Islam, '//*[@id="image-main"]/@src' is generating the image url and I'm not getting any issues. Please, see the output whether that's your expacted or not.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BasicSpider(CrawlSpider):
    name = 'basic'
    allowed_domains = ['www.leebmann24.de']
    start_urls = ['https://www.leebmann24.de/bmw.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='category-products']/ul/li/h2/a"), callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):

        yield {
            'URL': response.url,
            'Price': response.xpath("normalize-space(//span[@class='price']/text())").get(),
            'image_urls': response.xpath('//*[@id="image-main"]/@src').get()
        } 

Output:

{'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-3er-f30-f31.html', 'Price': '57,29\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452302924-1.jpg'}
2022-09-07 02:35:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html> (referer: https://www.leebmann24.de/bmw.html?p=2)
2022-09-07 02:35:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html>
{'URL': 'https://www.leebmann24.de/bmw-erste-hilfe-set-klarsichtbeutel-51477158344.html', 'Price': '15,64\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/b/m/bmw-erste-hilfe-klarsichtbeutel-51477158433.jpg'}
2022-09-07 02:35:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.leebmann24.de/erste-hilfe-set.html> (failed 1 times): 503 Service Unavailable
2022-09-07 02:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html> (referer: https://www.leebmann24.de/bmw.html)
2022-09-07 02:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html>
{'URL': 'https://www.leebmann24.de/aruma-antirutschmatte-x5-f15.html', 'Price': '71,66\xa0€', 'image_urls': 'https://www.leebmann24.de/media/catalog/product/cache/1/image/363x/040ec09b1e35df139433887a97daa66f/a/r/aruma-antirutschmatte-94452347734-1.jpg'}

CodePudding user response:

This expression will get all product images except main (you said that you already have it):

'//div[@id="itemslider-zoom"]//a/@href'
  • Related