I'm having problem with scrapy. I created nba.json with these codes in Terminal (scrapy crawl nba -o nba.json) But json is empty. I don't know why. Additionally, before this, I used these codes in another JSON document, and it worked. Can anyone help me, pls? Thanks in advance!
import scrapy
class QuotesSpider(scrapy.Spider):
name = "nba"
start_urls = ["https://www.espn.com/nba/stats/_/season/2020/seasontype/2"]
def parse(self, response):
for content in response.xpath("//*[@id='fittPageContainer']/div[3]/div/div/section[1]/div/div[4]/div[1]/div/div[2]/div/div/div[2]/table/tbody/tr"):
yield {
"name" : content.xpath('td[1]/div/a/text()').get(),
"team" : content.xpath('td[1]/div/span[2]/text()').get(),
"ppg" : content.xpath('td[2]/text()').get()
}
next_page = response.xpath('').get()
if next_page is not None:
yield response.follow(next_page, callback = self.parse)
CodePudding user response:
Some of that information is rendered via javascript.
You can use scrapy-playwright plugin to get the rendered content.
pip install scrapy-playwright
playwright install
then in settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
then in your spider you just need to add the playwright meta tag to requests.
For example:
name = "nba"
start_urls = ["https://www.espn.com/nba/stats/_/season/2020/seasontype/2"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={'playwright': True})
def parse(self, response):
for content in response.xpath("//div[@class='mb1']"):
if content.xpath('./div/text()').get() == "Offensive Leaders":
for table in content.xpath('.//div[@]'):
for row in table.xpath('.//tbody/tr'):
yield {
"name": row.xpath('.//a/text()').getall(),
"team": row.xpath('.//span/text()').getall(),
"ppg": row.xpath('.//td[@]/text()').get()
}
OUTPUT:
2022-12-07 17:04:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.espn.com/nba/stats/_/season/2020/seasontype/2> (referer: https://www.espn.com/) ['playwright']
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'James Harden', 'team': 'HOU', 'ppg': '34.3'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Bradley Beal', 'team': 'WSH', 'ppg': '30.5'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Damian Lillard', 'team': 'POR', 'ppg': '30.0'}
2022-12-07 17:04:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espn.com/nba/stats/_/season/2020/seasontype/2>
{'name': 'Trae Young', 'team': 'ATL', 'ppg': '29.6'}
CodePudding user response:
It seems like your IP is temporarily blocked as you've made multiple requests from the same IP. You can use a proxy solution to get rid of this issue.