I'm trying to scrape data from a tournaments site.
Each tournament has some information such as the venue, the date, prices etc. And also the rank of teams that took part. The rank is a table that simply provides the name of the team, and its position in the rank.
Then, you can click on the name of the team which takes you to a page were we can get the roster of players that the team selected for that tournament.
I'd like to scrape the data into something like:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}
]
}]
I have the following spider to scrape a single tournament page (usage: scrapy crawl tournamentspider -a strat_url="<tournamenturl>"
)
class TournamentSpider(scrapy.Spider):
name = "tournamentspider"
allowed_domains = ["..."]
def start_requests(self):
try:
yield scrapy.Request(url=self.start_url, callback=self.parse)
except AttributeError:
raise ValueError("You must use this spider with argument start_url.")
def parse(self, response):
tournament_item = TournamentItem()
tournament_item['teams'] = []
tournament_item ['name'] = "Tournament Name"
tournament_item['date'] = "Date"
tournament_item['venue'] = "Venue"
ladder = response.css('#ladder')
for row in ladder.css('table tbody tr'):
row_cells = row.xpath('td')
participation_item = PlayerParticipationItem()
participation_item['team_name'] = "Team Name"
participation_item['rank'] = "x"
# Parse roster
roster_url_page = row_cells[2].xpath('a/@href').get()
# Follow link to extract list
base_url = urlparse(response.url)
absolute_url = f'{base_url.scheme}://{base_url.hostname}/{list_url_page}'
request = scrapy.Request(absolute_url, callback=self.parse_roster_page)
request.meta['participation_item'] = participation_item
yield request
# Include participation item in the roster
tournament_item['players'].append(participation_item)
yield tournament_item
def parse_roster_page(self, response):
participation_item = response.meta['participation_item']
participation_item['roster'] = ["Player1", "Player2", "..."]
return participation_item
My problem is that this spider produces the following output:
[{
"name": "Grand Tournament",
"venue": "...",
"date": "...",
"rank": [
{"team_name": "Team name",
"rank": 1,
},
{"team_name": "Team name",
"rank": 2,
}
]
},
{"team_name": "Team name",
"rank": 1,
"roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
"rank": 2,
"roster": ["player1", "player2", "..."]
}]
I know that those extra items in the output are generated by the yield request
line. When I remove it, I'm no longer scrapping the roster page, so the extra items disappear, but I no longer have the roster data.
Is is possible to get the output I'm aiming for?
I know that a different approach could be to scrape the tournament information, and then teams with a field that identifies the tournament. But I'd like to know if the initial approach is achievable.
CodePudding user response:
you can use scrapy inline requests to to call parse_roster_page
and you'll get the roster data without yielding it out.
The only change you need to include is the decorator @inline_requests
with the function parse_roster_page
.
from inline_requests import inline_requests
class TournamentSpider(scrapy.Spider):
def parse(self, response):
...
@inline_requests
def parse_roster_page(self, response):
...