Home > Enterprise >  Use scrapy to collect information for one item from multiple pages (and output it as a nested dictio
Use scrapy to collect information for one item from multiple pages (and output it as a nested dictio

Time:11-16

I'm trying to scrape data from a tournaments site.

Each tournament has some information such as the venue, the date, prices etc. And also the rank of teams that took part. The rank is a table that simply provides the name of the team, and its position in the rank.

Then, you can click on the name of the team which takes you to a page were we can get the roster of players that the team selected for that tournament.

I'd like to scrape the data into something like:

[{
  "name": "Grand Tournament",
  "venue": "...",
  "date": "...",
  "rank": [
    {"team_name": "Team name",
     "rank": 1,
     "roster": ["player1", "player2", "..."]
    },
    {"team_name": "Team name",
     "rank": 2,
     "roster": ["player1", "player2", "..."]
    }
  ]
}]

I have the following spider to scrape a single tournament page (usage: scrapy crawl tournamentspider -a strat_url="<tournamenturl>")

class TournamentSpider(scrapy.Spider):
    name = "tournamentspider"
    allowed_domains = ["..."]

    def start_requests(self):
        try:
            yield scrapy.Request(url=self.start_url, callback=self.parse)
        except AttributeError:
            raise ValueError("You must use this spider with argument start_url.")

    def parse(self, response):
        tournament_item = TournamentItem()
        tournament_item['teams'] = []

        tournament_item ['name'] = "Tournament Name"
        tournament_item['date'] = "Date"
        tournament_item['venue'] = "Venue"

        ladder = response.css('#ladder')
        
        for row in ladder.css('table tbody tr'):
            row_cells = row.xpath('td')

            participation_item = PlayerParticipationItem()
            participation_item['team_name'] = "Team Name"             
            participation_item['rank'] = "x"

            # Parse roster
            roster_url_page = row_cells[2].xpath('a/@href').get()

            # Follow link to extract list
            base_url = urlparse(response.url)
            absolute_url = f'{base_url.scheme}://{base_url.hostname}/{list_url_page}'

            request = scrapy.Request(absolute_url, callback=self.parse_roster_page)
            request.meta['participation_item'] = participation_item
            yield request

            # Include participation item in the roster
            tournament_item['players'].append(participation_item)

        yield tournament_item


    def parse_roster_page(self, response):
        participation_item = response.meta['participation_item']
        participation_item['roster'] = ["Player1", "Player2", "..."]
        return participation_item

My problem is that this spider produces the following output:

[{
  "name": "Grand Tournament",
  "venue": "...",
  "date": "...",
  "rank": [
    {"team_name": "Team name",
     "rank": 1,
    },
    {"team_name": "Team name",
     "rank": 2,
    }
  ]
},
{"team_name": "Team name",
 "rank": 1,
 "roster": ["player1", "player2", "..."]
},
{"team_name": "Team name",
 "rank": 2,
 "roster": ["player1", "player2", "..."]
}]

I know that those extra items in the output are generated by the yield request line. When I remove it, I'm no longer scrapping the roster page, so the extra items disappear, but I no longer have the roster data.

Is is possible to get the output I'm aiming for?

I know that a different approach could be to scrape the tournament information, and then teams with a field that identifies the tournament. But I'd like to know if the initial approach is achievable.

CodePudding user response:

you can use scrapy inline requests to to call parse_roster_page and you'll get the roster data without yielding it out.

The only change you need to include is the decorator @inline_requests with the function parse_roster_page.

from inline_requests import inline_requests

class TournamentSpider(scrapy.Spider):

    def parse(self, response):
      ...

    @inline_requests
    def parse_roster_page(self, response):
      ...
  • Related