Home > database >  How to collect data into single item from multiple urls with scrapy python
How to collect data into single item from multiple urls with scrapy python

Time:05-26

In simpler term, I would like to grab the return value from the callback function until the for loop exhausted and the yield single item after that.

What I am trying to do is following,
I am creating new links, which represent click on tabs on https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/ such as

  1. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#ah;2

  2. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#over-under;2 and so on. But they are basically data for the same match, so I am trying to collect bet info into one single time.

Basically, I am using a for loop with dict to create a new link and yielding the request with callback function.

class CountryLinksSpider(scrapy.Spider):
    name = 'country_links'
    allowed_domains = ['oddsportal.com']
    start_urls = ['https://www.oddsportal.com/soccer/africa/caf-champions-league/es-setif-al-ahly-AsWAHRrD/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.create_all_tabs_links_from_url)

    def create_all_tabs_links_from_url(self, response):
        current_url = response.request.url
        _other_useful_scrape_data_dict = OrderedDict(
            [('time', '19:00'), ('day', '14'), ('month', 'May'), ('year', '22'), ('Country', 'Africa'),
             ('League', 'CAF Champions'), ('Home', 'ES Setif'), ('Away', 'Al Ahly'), ('FT1', '2'), ('FT2', '2'),
             ('FT', 'FT'), ('1H H', '1'), ('1H A', '1'), ('1HHA', 'D'), ('2H H', '1'), ('2H A', 1), ('2HHA', 'D')])

        with requests.Session() as s:
            s.headers = {
                "accept": "*/*",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "en-US,en;q=0.9,pl;q=0.8",
                "referer": 'https://www.oddsportal.com',
                "user-agent": fake_useragent.UserAgent().random
            }
            r = s.get(current_url)
            version_id = re.search(r'"versionId":(\d )', r.text).group(1)
            sport_id = re.search(r'"sportId":(\d )', r.text).group(1)
            xeid = re.search(r'"id":"(.*?)"', r.text).group(1)

            xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))

        unix = int(round(time.time() * 1000))

        tabs_dict = {'#ah;2': ['5-2', 'AH full time', ['1', '2']], '#ah;3': ['5-3', 'AH 1st half', ['1', '2']],
                     '#ah;4': ['5-4', 'AH 2nd half', ['1', '2']], '#dnb;2': ['6-2', 'DNB full_time', ['1', '2']]}
        all_tabs_data = OrderedDict()
        all_tabs_data = all_tabs_data | _other_useful_scrape_data_dict

        for key, value in tabs_dict.items():
            api_url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{value[0]}-{xhash}.dat?_={unix}'

            # goto each main tabs and get data from it and yield here
            single_tab_scrape_data = yield scrapy.http.Request(api_url,
                                                        callback=self.scrape_single_tab)
        # following i want to do (collect all the data from all tabs into single item)
        # all_tabs_data = all_tabs_data | single_tab_scrape_data # (as a dict)

    # yield all_tabs_data  # yield single dict with scrape data from all the tabs

    def scrape_single_tab(self, response):
        # sample scraped data from the response
        scrape_dict = OrderedDict([('AH full time -0.25 closing 2', 1.59), ('AH full time -0.25 closing 1', 2.3),
                                   ('AH full time -0.25 opening 2', 1.69), ('AH full time -0.25 opening 1', 2.12),
                                   ('AH full time -0.50 opening 1', ''), ('AH full time -0.50 opening 2', '')])

        yield scrape_dict

What i have tried, first i tried simple returning the scrape item from the scrape_match_data fuction. but i could not find a way to grab the return value of callback function from the yield request.

I have tried using following libraries, from inline_requests import inline_requests from twisted.internet.defer import inlineCallbacks

but i can not make it work. i feel like there must simpler way to append scraped item from different links into one item and yield it.

Please help me to solve this issue.

CodePudding user response:

Technically in scrapy we have 2 approaches to transfer data between callback functions we are using to construct items from multiple requests:

1. Request meta dictionary:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        meta = {'scraped_item_data': data})

def parse_details(self, response):
    scraped_data = response.meta.get('scraped_item_data') # <- not present in Your code
    ...

probably You missed to call response.meta.get('_scrape_dict') to access data scraped from previous callback function

2. cb_kwargs accessible for scrapy version 1.7 and newer:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        cb_kwargs={'scraped_item_data': data})

def parse_details(self, response, scraped_item_data):  # <- already accessible data from previous request
    ...

3.Single Item from multiple.. responses with the same type.
The easiest way to implement it is to assign data to class variable. The code will look like this:

def parse(self, response):
    self.tabs_data = []
    ...
    self.tabs_number = len(tabs) #  or len(list(tabs)) # <number of tabs
    for tab in tabs:
        yield Request(...

def parse_details(self, response)
    scraped_tab_data = ...
    self.tabs_data.append(scraped_tab_data)
    if len(self.tabs_data) == self.tabs_number: # when data from all tabs collected
        # compose one big item
        ...
        yield item

  • Related