Why is json null?-CodePudding

I'm having problem with scrapy. I created nba.json with these codes in Terminal (scrapy crawl nba -o nba.json) But json is empty. I don't know why. Additionally, before this, I used these codes in another JSON document, and it worked. Can anyone help me, pls? Thanks in advance!

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "nba"
    start_urls = ["https://www.espn.com/nba/stats/_/season/2020/seasontype/2"]
    def parse(self, response):
        for content in response.xpath("//*[@id='fittPageContainer']/div[3]/div/div/section[1]/div/div[4]/div[1]/div/div[2]/div/div/div[2]/table/tbody/tr"):
            yield {
                "name" : content.xpath('td[1]/div/a/text()').get(),
                "team" : content.xpath('td[1]/div/span[2]/text()').get(),
                "ppg" : content.xpath('td[2]/text()').get()
            }

        next_page = response.xpath('').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

CodePudding user response：

Some of that information is rendered via javascript.

You can use scrapy-playwright plugin to get the rendered content.

pip install scrapy-playwright
playwright install

then in settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

then in your spider you just need to add the playwright meta tag to requests.

For example:

    name = "nba"
    start_urls = ["https://www.espn.com/nba/stats/_/season/2020/seasontype/2"]
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={'playwright': True})

    def parse(self, response):
        for content in response.xpath("//div[@class='mb1']"):
            if content.xpath('./div/text()').get() == "Offensive Leaders":
                for table in content.xpath('.//div[@]'):
                    for row in table.xpath('.//tbody/tr'):
                        yield {
                        "name":  row.xpath('.//a/text()').getall(),
                        "team": row.xpath('.//span/text()').getall(),
                        "ppg": row.xpath('.//td[@]/text()').get()
                        }

OUTPUT:

2022-12-07 17:04:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.espn.com/nba/stats/_/season/2020/seasontype/2> (referer: https://www.espn.com/) ['playwright']
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'James Harden', 'team': 'HOU', 'ppg': '34.3'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Bradley Beal', 'team': 'WSH', 'ppg': '30.5'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Damian Lillard', 'team': 'POR', 'ppg': '30.0'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Trae Young', 'team': 'ATL', 'ppg': '29.6'}

CodePudding user response：

It seems like your IP is temporarily blocked as you've made multiple requests from the same IP. You can use a proxy solution to get rid of this issue.