Home > OS >  scrapy returns None
scrapy returns None

Time:01-21

I am new at scrapy. I want to scrap data from alibaba.com but I'm getting none. I don't know where is the problem. Here is my code

class ALibabaSpider(scrapy.Spider):
    name = 'alibaba'
    allowed_domains = ['alibaba.com']
    search_value = 'laptop'
    start_urls = [f'https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={search_value}']
    
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'

    def parse(self, response):
        title = response.xpath("//div[@class='list-no-v2-main__top-area']/h2/a/@href").get()

        yield{
            'title': title
        }

And I am getting

2023-01-18 15:29:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=laptop> (referer: None)
2023-01-18 15:29:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText=laptop>
{'title': None}
2023-01-18 15:29:33 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-18 15:29:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

I tried with both xpath and css selector

CodePudding user response:

As @SuperUser told you, the spider gets None because the site uses Javascript to render the product information. If you disable Javascript in your browser and reload the page, you will see that the products are not displayed.

However you can get the information from one of the <script> tags.

import scrapy
import json


class AlibabaSpider(scrapy.Spider):
    name = "alibaba"
    allowed_domains = ["alibaba.com"]
    search_value = "laptop"
    start_urls = [f"https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={search_value}"]

    def parse(self, response):
        raw_data = response.xpath("//script[contains(., 'window.__page__data__config')]/text()").extract_first()
        raw_data = raw_data.replace("window.__page__data__config = ", "").replace("window.__page__data = window.__page__data__config.props", "")
        data = json.loads(raw_data)

        title = data["props"]["offerResultData"]["offerList"][0]["information"]["puretitle"]
        yield {"title": title} # Laptops Laptop Cheapest OEM Core I5...

  • Related