I recently started to learn scrapy and decided to scrape this site.
There are 24 products on 1 page, and when you scroll down more products load.
There should be about 334 products on this page.
I used scrapy and tried to scrape the products and information inside, but I can't make scrapy to scrape more than 24 products.
I think, I need selenium or splash to render/scroll down to the end, and then I would be able to scrape it.
This is the code that scrapes 24 products:
import scrapy
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 OPR/92.0.0.0'
}
class BookSpider(scrapy.Spider):
name = 'basics2'
api_url = 'https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page'
start_urls = ['https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page=1']
#Def parse goes to the href of every product
def parse(self, response):
for link in response.xpath("//div[@class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--1th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product carousel__item product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
#def parse-book gets all the information inside each product
def parse_book(self, response):
yield{
'title' : response.xpath("//div[@class='product-detail-info__header']/h1/text()").get(),
'normal_price' : response.xpath("//div[@class='money-amount price-formatted__price-amount']//span//text()").get(),
'discounted_price' : response.xpath("(//span[@class='price__amount price__amount--on-sale price-current--with-background']//div[@class='money-amount price-formatted__price-amount']//span)[1]").get(),
'Reference' : response.xpath("//div[@class='product-detail-color-selector product-detail-info__color-selector']//p[@class='product-detail-selected-color product-detail-color-selector__selected-color-name']//text()").get(),
'Description' : response.xpath("//div[@class='expandable-text__inner-content']//p//text()").get(),
'Image' : response.xpath("//picture[@class='media-image']//source//@srcset").extract(),
'item_url' : response.url,
# 'User-Agent': response.request.headers['User-Agent']
}
CodePudding user response:
No need to use so slow and complex selenium, You can grab all the requred data from API
like:
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
yield {
"name":data.get("commercialComponents")[0]['name']
}
Output:
{'name': 'БОТИЛЬОНЫ ИЗ ТКАНИ С ОТДЕЛКОЙ ПАЙЕТКАМИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТУФЛИ С ОТДЕЛКОЙ ПАЙЕТКАМИ, НА КАБЛУКЕ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ФУТБОЛКА С ВОРОТНИКОМ-СТОЙКОЙ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'СУМКА-ШОПЕР С УЗЛАМИ НА ЛЯМКАХ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНАЯ ЮБКА ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНОЕ ПЛАТЬЕ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНЫЕ ЛЕГИНСЫ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 22:39:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186484,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 3.171018,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 19, 16, 39, 52, 441260),
'httpcompression/response_bytes': 2096267,
'httpcompression/response_count': 1,
'item_scraped_count': 476,
Update: See the updated answer how to extract image url from the API
responsed data of this website.
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
name = data.get("commercialComponents")[0]['xmedia'][0]['name']
#print(name)
path = data.get("commercialComponents")[0]['xmedia'][0]['path']
#print(path)
ts = data.get("commercialComponents")[0]['xmedia'][0]['timestamp']
#print(ts)
img = 'https://static.zara.net/photos//' path '/' name '.jpg?ts=' ts
#print(img)
yield {
"image_url": img
}
Output:
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_2_2_1.jpg?ts=1668003224849'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_1_1_1.jpg?ts=1668003224932'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/744/505/2/1067744505_1_1_1.jpg?ts=1668155524538'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_1_1.jpg?ts=1668085284347'}2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8587/866/099/2/8587866099_1_1_1.jpg?ts=1668003219701'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_10_1.jpg?ts=1668081955599'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/5388/629/711/2/5388629711_1_1_1.jpg?ts=1668008862794'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/800/2/6672010800_1_1_1.jpg?ts=1668172065554'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/002/2/6672010002_2_3_1.jpg?ts=1668164312812'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_8_1.jpg?ts=1668696590284'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/938/822/2/7901938822_2_5_1.jpg?ts=1668767172364'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/935/822/2/7901935822_2_5_1.jpg?ts=1668764555064'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_1_1.jpg?ts=1668691124206'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/936/822/2/7901936822_2_5_1.jpg?ts=1668767061454'}
2022-11-20 23:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 23:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186815,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.670308,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 17, 16, 14, 180866),
'httpcompression/response_bytes': 2100146,
'httpcompression/response_count': 1,
'item_scraped_count': 474,
... so on