Home > Blockchain >  Scrapy: passing parameters to cookies
Scrapy: passing parameters to cookies

Time:06-07

it is necessary to bypass all the locations of this site mkm If I understood correctly, geolocation is transmitted by the ID parameter in the url (https://mkm-metal.ru/?REGION_ID=141 ) and ID parameters in cookies ('BITRIX_SM_CITY_ID': loc_id).

import scrapy
import re


class Mkm(scrapy.Spider):
    name = 'mkm'

    def start_requests(self, **cb_kwargs):
        for loc_id in ['142', '8', '12', '96']:
            url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
            cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                # meta={'cookiejar': loc_id},
                cookies=cb_kwargs['cookies'],
                cb_kwargs=cb_kwargs,
            )

    def parse(self, response, **cb_kwargs):
        yield scrapy.Request(
            url='https://mkm-metal.ru/catalog/',
            callback=self.parse_2,
            # meta={'cookiejar': response.meta['cookiejar']},
            cookies=cb_kwargs['cookies'],
        )

    def parse_2(self, response, **cb_kwargs):
        city = response.css('a.place span::text').get().strip()
        print(city, response.url)

But in my case, the parse_2 method returns only one city (first ID = 142). what's wrong? where is the error?

here's the log...

2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=142> (referer: None)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=8> (referer: None)
2022-06-05 17:32:46 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mkm-metal.ru/catalog/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/catalog/> (referer: https://mkm-metal.ru/?REGION_ID=142)
Бугульма https://mkm-metal.ru/catalog/
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=12> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=96> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] INFO: Closing spider (finished)

CodePudding user response:

In function parse you request the same url for every cookie. Scrapy filters duplicate requests so you only get the first request and the rest are ignored. Add dont_filter=True:

import scrapy
import re


class Mkm(scrapy.Spider):
    name = 'mkm'

    def start_requests(self, **cb_kwargs):
        for loc_id in ['142', '8', '12', '96']:
            url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
            cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                # meta={'cookiejar': loc_id},
                cookies=cb_kwargs['cookies'],
                cb_kwargs=cb_kwargs,
            )

    def parse(self, response, **cb_kwargs):
        yield scrapy.Request(
            url='https://mkm-metal.ru/catalog/',
            callback=self.parse_2,
            # meta={'cookiejar': response.meta['cookiejar']},
            cookies=cb_kwargs['cookies'],
            dont_filter=True
        )

    def parse_2(self, response, **cb_kwargs):
        city = response.css('a.place span::text').get().strip()
        print(city, response.url)
  • Related