Some retrieved values are not processed by the itemloader processor (scrapy,itemloader)-CodePudding

With the following code, the values themselves are all obtained accurately. However, a portion of the retrieved values (conversion date) cannot be processed by the ItemLoader and is output as is.

After various verifications, we found that It seems that the values obtained in the parse_firstpage_item are not passed to Itemloader. Each item retrieved by the parse_productpage_item is processed properly.

I have verified that the description of the processor in the Itemloader is also correct, because the output was output in the desired form if the values were passed to the Itemloader.

Therefore, I assume that there is a problem with the description of spider.

I am a beginner, so it is really difficult to understand how data is processed on scrapy...

class AllSaledataSpider(CrawlSpider):
    name = 'all_salesdata_copy'
    allowed_domains = ['www.buyma.com']

    #Read from shopper URL list saved in csv file
def start_requests(self):
         with open('/Users/morni/BUYMA/buyma_researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
            reader = csv.reader(f)
            for row in reader:
                for n in range(1, 3): #300
                    url =str((row[2])[:-5] '/sales_' str(n) '.html')
                    # f'{self.base_page}{row}/sales_{n}.html'
                    yield scrapy.Request(
                        url=url,
                        callback=self.parse_firstpage_item,
                        # errback=self.errback_httpbin,
                        dont_filter=True
                        )
#Obtain the 30 product links on the order history page, and the date of sign-up (obtained here, since they are not listed on the individual product pages), store these two pieces of information in item, and pass the request to the next parsing method.

    def parse_firstpage_item(self, response): 
            conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
            product_url = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()
            for i in range(2):
                loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
                loader.add_value("conversion_date", conversion_date[i])
                loader.add_value("product_url", product_url[i])
                item = loader.load_item()
                yield scrapy.Request(
                    url=response.urljoin(item["product_url"][-1]),
                    callback=self.parse_productpage_item,
                    cb_kwargs={'item': item},
                    )

 #Retrieve the content in the product detail page and merge it with the previously retrieved information

       def parse_productpage_item(self, response, item): 
                loader = ItemLoader(item=item, response = response)
                loader.add_xpath("product_name", 'normalize-space(//li[@]/span[1]/text())')
                loader.add_xpath("brand_name", 'normalize-space(//[@id="s_brand"]/dd/a/text())')
    〜〜
    
                    yield loader.load_item()

The Itemloader process is as follows

def strip_n(element):
    if element:
        return element.replace('\t', '').replace('\n', '')
    return element

def conversion_dates(element):
        if element:
            str = element.replace('成約日：', '')
            dte = datetime.datetime.strptime(str, '%Y/%m/%d')
            return dte
        return element
    
class BuymaResearchtoolItem(scrapy.Item):

#first_page
    conversion_date = scrapy.Field(
        input_processors = MapCompose(conversion_dates),
        output_processors = TakeFirst()
    )
    product_url = scrapy.Field(
        output_processors = TakeFirst()
        )
    
#product_page
product_name = scrapy.Field(
        input_processors = MapCompose(strip_n),
        output_processors = TakeFirst()
    )
brand_name = scrapy.Field(
        input_processors = MapCompose(strip_n),
        output_processors = TakeFirst()
    )

CodePudding user response：

There are a few issues I noticed. The first being that your indentation is way off and is guaranteed to throw an error. Assuming that is just a copy and paste issue though you are also using str as a variable when it is also a type name in your conversion_dates function. And the last thing is that you are using the incorrect keyword arguments in your Item class for each of the fields.

Fixing these minor problems should make your spider run as expected.

For example:

conversion_dates(element):
    if element:
         s = element.replace('成約日：', '')
         element = datetime.datetime.strptime(s, '%Y/%m/%d')
    return element

class BuymaResearchtoolItem(scrapy.Item):
    conversion_date = scrapy.Field(
        input_processor = MapCompose(conversion_dates),
        output_processor = TakeFirst()
    )
    product_url = scrapy.Field(
        output_processor = TakeFirst()
    )
    product_name = scrapy.Field(
        input_processor = MapCompose(strip_n),
        output_processor = TakeFirst()
    )
    brand_name = scrapy.Field(
        input_processor = MapCompose(strip_n),
        output_processor = TakeFirst()
    )