With the following code, the values themselves are all obtained accurately. However, a portion of the retrieved values (conversion date) cannot be processed by the ItemLoader and is output as is.
After various verifications, we found that It seems that the values obtained in the parse_firstpage_item are not passed to Itemloader. Each item retrieved by the parse_productpage_item is processed properly.
I have verified that the description of the processor in the Itemloader is also correct, because the output was output in the desired form if the values were passed to the Itemloader.
Therefore, I assume that there is a problem with the description of spider.
I am a beginner, so it is really difficult to understand how data is processed on scrapy...
class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy'
allowed_domains = ['www.buyma.com']
#Read from shopper URL list saved in csv file
def start_requests(self):
with open('/Users/morni/BUYMA/buyma_researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for row in reader:
for n in range(1, 3): #300
url =str((row[2])[:-5] '/sales_' str(n) '.html')
# f'{self.base_page}{row}/sales_{n}.html'
yield scrapy.Request(
url=url,
callback=self.parse_firstpage_item,
# errback=self.errback_httpbin,
dont_filter=True
)
#Obtain the 30 product links on the order history page, and the date of sign-up (obtained here, since they are not listed on the individual product pages), store these two pieces of information in item, and pass the request to the next parsing method.
def parse_firstpage_item(self, response):
conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
product_url = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()
for i in range(2):
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_value("conversion_date", conversion_date[i])
loader.add_value("product_url", product_url[i])
item = loader.load_item()
yield scrapy.Request(
url=response.urljoin(item["product_url"][-1]),
callback=self.parse_productpage_item,
cb_kwargs={'item': item},
)
#Retrieve the content in the product detail page and merge it with the previously retrieved information
def parse_productpage_item(self, response, item):
loader = ItemLoader(item=item, response = response)
loader.add_xpath("product_name", 'normalize-space(//li[@]/span[1]/text())')
loader.add_xpath("brand_name", 'normalize-space(//[@id="s_brand"]/dd/a/text())')
〜〜
yield loader.load_item()
The Itemloader process is as follows
def strip_n(element):
if element:
return element.replace('\t', '').replace('\n', '')
return element
def conversion_dates(element):
if element:
str = element.replace('成約日:', '')
dte = datetime.datetime.strptime(str, '%Y/%m/%d')
return dte
return element
class BuymaResearchtoolItem(scrapy.Item):
#first_page
conversion_date = scrapy.Field(
input_processors = MapCompose(conversion_dates),
output_processors = TakeFirst()
)
product_url = scrapy.Field(
output_processors = TakeFirst()
)
#product_page
product_name = scrapy.Field(
input_processors = MapCompose(strip_n),
output_processors = TakeFirst()
)
brand_name = scrapy.Field(
input_processors = MapCompose(strip_n),
output_processors = TakeFirst()
)
CodePudding user response:
There are a few issues I noticed. The first being that your indentation is way off and is guaranteed to throw an error. Assuming that is just a copy and paste issue though you are also using str
as a variable when it is also a type name in your conversion_dates
function. And the last thing is that you are using the incorrect keyword arguments in your Item class for each of the fields.
Fixing these minor problems should make your spider run as expected.
For example:
conversion_dates(element):
if element:
s = element.replace('成約日:', '')
element = datetime.datetime.strptime(s, '%Y/%m/%d')
return element
class BuymaResearchtoolItem(scrapy.Item):
conversion_date = scrapy.Field(
input_processor = MapCompose(conversion_dates),
output_processor = TakeFirst()
)
product_url = scrapy.Field(
output_processor = TakeFirst()
)
product_name = scrapy.Field(
input_processor = MapCompose(strip_n),
output_processor = TakeFirst()
)
brand_name = scrapy.Field(
input_processor = MapCompose(strip_n),
output_processor = TakeFirst()
)