I am new at scrapy. I want to scrap data from alibaba.com but I'm getting none. I don't know where is the problem. Here is my code
class ALibabaSpider(scrapy.Spider):
name = 'alibaba'
allowed_domains = ['alibaba.com']
search_value = 'laptop'
start_urls = [f'https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={search_value}']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def parse(self, response):
title = response.xpath("//div[@class='list-no-v2-main__top-area']/h2/a/@href").get()
yield{
'title': title
}
And I am getting
2023-01-18 15:29:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=laptop> (referer: None)
2023-01-18 15:29:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=laptop>
{'title': None}
2023-01-18 15:29:33 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-18 15:29:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
I tried with both xpath and css selector
CodePudding user response:
As @SuperUser told you, the spider gets None
because the site uses Javascript to render the product information. If you disable Javascript in your browser and reload the page, you will see that the products are not displayed.
However you can get the information from one of the <script>
tags.
import scrapy
import json
class AlibabaSpider(scrapy.Spider):
name = "alibaba"
allowed_domains = ["alibaba.com"]
search_value = "laptop"
start_urls = [f"https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={search_value}"]
def parse(self, response):
raw_data = response.xpath("//script[contains(., 'window.__page__data__config')]/text()").extract_first()
raw_data = raw_data.replace("window.__page__data__config = ", "").replace("window.__page__data = window.__page__data__config.props", "")
data = json.loads(raw_data)
title = data["props"]["offerResultData"]["offerList"][0]["information"]["puretitle"]
yield {"title": title} # Laptops Laptop Cheapest OEM Core I5...