it is necessary to bypass all the locations of this site mkm If I understood correctly, geolocation is transmitted by the ID parameter in the url (https://mkm-metal.ru/?REGION_ID=141 ) and ID parameters in cookies ('BITRIX_SM_CITY_ID': loc_id).
import scrapy
import re
class Mkm(scrapy.Spider):
name = 'mkm'
def start_requests(self, **cb_kwargs):
for loc_id in ['142', '8', '12', '96']:
url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
yield scrapy.Request(
url=url,
callback=self.parse,
# meta={'cookiejar': loc_id},
cookies=cb_kwargs['cookies'],
cb_kwargs=cb_kwargs,
)
def parse(self, response, **cb_kwargs):
yield scrapy.Request(
url='https://mkm-metal.ru/catalog/',
callback=self.parse_2,
# meta={'cookiejar': response.meta['cookiejar']},
cookies=cb_kwargs['cookies'],
)
def parse_2(self, response, **cb_kwargs):
city = response.css('a.place span::text').get().strip()
print(city, response.url)
But in my case, the parse_2 method returns only one city (first ID = 142). what's wrong? where is the error?
here's the log...
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=142> (referer: None)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=8> (referer: None)
2022-06-05 17:32:46 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mkm-metal.ru/catalog/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/catalog/> (referer: https://mkm-metal.ru/?REGION_ID=142)
Бугульма https://mkm-metal.ru/catalog/
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=12> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=96> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] INFO: Closing spider (finished)
CodePudding user response:
In function parse
you request the same url for every cookie. Scrapy filters duplicate requests so you only get the first request and the rest are ignored. Add dont_filter=True
:
import scrapy
import re
class Mkm(scrapy.Spider):
name = 'mkm'
def start_requests(self, **cb_kwargs):
for loc_id in ['142', '8', '12', '96']:
url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
yield scrapy.Request(
url=url,
callback=self.parse,
# meta={'cookiejar': loc_id},
cookies=cb_kwargs['cookies'],
cb_kwargs=cb_kwargs,
)
def parse(self, response, **cb_kwargs):
yield scrapy.Request(
url='https://mkm-metal.ru/catalog/',
callback=self.parse_2,
# meta={'cookiejar': response.meta['cookiejar']},
cookies=cb_kwargs['cookies'],
dont_filter=True
)
def parse_2(self, response, **cb_kwargs):
city = response.css('a.place span::text').get().strip()
print(city, response.url)