How to scrape products from finite-scrolling page using scrapy?-CodePudding

I recently started to learn scrapy and decided to scrape this site.

There are 24 products on 1 page, and when you scroll down more products load.

There should be about 334 products on this page.

I used scrapy and tried to scrape the products and information inside, but I can't make scrapy to scrape more than 24 products.

I think, I need selenium or splash to render/scroll down to the end, and then I would be able to scrape it.

This is the code that scrapes 24 products:

import scrapy

custom_settings = { 
   'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 OPR/92.0.0.0'
    }

class BookSpider(scrapy.Spider):
    name = 'basics2'
    api_url = 'https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page'
    start_urls = ['https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page=1']


#Def parse goes to the href of every product 

    def parse(self, response):
        for link in response.xpath("//div[@class='product-grid-product-info__main-info']//a"):
            yield response.follow(link, callback=self.parse_book)
        for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
            yield response.follow(link, callback=self.parse_book)    
        for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--1th-column']//a"):
            yield response.follow(link, callback=self.parse_book)   
        for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--th-column']//a"):
            yield response.follow(link, callback=self.parse_book)   
        for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product carousel__item product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
            yield response.follow(link, callback=self.parse_book) 
        for link in response.xpath("//ul[@class='product-grid-product-info__main-info']//a"):
            yield response.follow(link, callback=self.parse_book) 


#def parse-book gets all the information inside each product
    def parse_book(self, response):
        yield{
            'title' : response.xpath("//div[@class='product-detail-info__header']/h1/text()").get(),
            'normal_price' : response.xpath("//div[@class='money-amount price-formatted__price-amount']//span//text()").get(),
            'discounted_price'  : response.xpath("(//span[@class='price__amount price__amount--on-sale price-current--with-background']//div[@class='money-amount price-formatted__price-amount']//span)[1]").get(),
            'Reference' : response.xpath("//div[@class='product-detail-color-selector product-detail-info__color-selector']//p[@class='product-detail-selected-color product-detail-color-selector__selected-color-name']//text()").get(),
            'Description'  : response.xpath("//div[@class='expandable-text__inner-content']//p//text()").get(),
            'Image' : response.xpath("//picture[@class='media-image']//source//@srcset").extract(),
            'item_url' : response.url,
            # 'User-Agent': response.request.headers['User-Agent']
    }

CodePudding user response：

No need to use so slow and complex selenium, You can grab all the requred data from API like:

import scrapy
import json
 
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"

class TestSpider(scrapy.Spider):
    name = "test"
    start_urls = [API_URL]
        
    custom_settings = {
        'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }
 
    def parse(self, response):
        json_response = json.loads(response.text)
        datas = json_response["productGroups"][0]['elements']
        for data in datas:
            yield {
                "name":data.get("commercialComponents")[0]['name']
                
                }

Output:

{'name': 'БОТИЛЬОНЫ ИЗ ТКАНИ С ОТДЕЛКОЙ ПАЙЕТКАМИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТУФЛИ С ОТДЕЛКОЙ ПАЙЕТКАМИ, НА КАБЛУКЕ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ФУТБОЛКА С ВОРОТНИКОМ-СТОЙКОЙ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'СУМКА-ШОПЕР С УЗЛАМИ НА ЛЯМКАХ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНАЯ ЮБКА ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНОЕ ПЛАТЬЕ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНЫЕ ЛЕГИНСЫ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 22:39:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 186484,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 3.171018,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 19, 16, 39, 52, 441260),
 'httpcompression/response_bytes': 2096267,
 'httpcompression/response_count': 1,
 'item_scraped_count': 476,

Update: See the updated answer how to extract image url from the API responsed data of this website.

import scrapy
import json
 
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"

class TestSpider(scrapy.Spider):
    name = "test"
    start_urls = [API_URL]
        
    custom_settings = {
        'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }
 
    def parse(self, response):
        json_response = json.loads(response.text)
        datas = json_response["productGroups"][0]['elements']
        for data in datas:
            name = data.get("commercialComponents")[0]['xmedia'][0]['name']
            #print(name)
            path = data.get("commercialComponents")[0]['xmedia'][0]['path']
            #print(path)
            ts = data.get("commercialComponents")[0]['xmedia'][0]['timestamp']
            #print(ts)
            img = 'https://static.zara.net/photos//'   path  '/' name '.jpg?ts='  ts
            #print(img)

            yield {
                "image_url": img
                
                }

Output:

{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_2_2_1.jpg?ts=1668003224849'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_1_1_1.jpg?ts=1668003224932'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/744/505/2/1067744505_1_1_1.jpg?ts=1668155524538'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_1_1.jpg?ts=1668085284347'}2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8587/866/099/2/8587866099_1_1_1.jpg?ts=1668003219701'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_10_1.jpg?ts=1668081955599'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/5388/629/711/2/5388629711_1_1_1.jpg?ts=1668008862794'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/800/2/6672010800_1_1_1.jpg?ts=1668172065554'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/002/2/6672010002_2_3_1.jpg?ts=1668164312812'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_8_1.jpg?ts=1668696590284'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/938/822/2/7901938822_2_5_1.jpg?ts=1668767172364'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/935/822/2/7901935822_2_5_1.jpg?ts=1668764555064'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_1_1.jpg?ts=1668691124206'} 
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/936/822/2/7901936822_2_5_1.jpg?ts=1668767061454'} 
2022-11-20 23:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 23:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 186815,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 2.670308,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 20, 17, 16, 14, 180866),
 'httpcompression/response_bytes': 2100146,
 'httpcompression/response_count': 1,
 'item_scraped_count': 474,

... so on