I have scraped 2 urls from a same spider as follows:
def start_requests(self):
#calling Dawn Categories
yield Request('https://www.dawn.com/business',callback=self.Dawn, meta={'category': 'business','source': 'DAWN'})
yield Request('https://www.dawn.com/sport',callback=self.Dawn, meta={'category': 'sports','source': 'DAWN'})
where self.Dawn scrapes the news from the links as follows:
def parseDawn(self, response):
items = WebscrapingItem()
title = response.css("h2.story__title a.story__link::text").extract_first().strip() ,
author = response.css("span.story__byline a.story__byline__link::text").extract_first() ,
category = response.meta['category']
items['title'] = title
items['author'] = author
items['category'] = category
yield items
Now, in my pipelines.py
file, I want to filter those scraped news that have category=='business'
and category=='sports'
in two separate dictionaries. I am doing this so that the filtered news can be saved separately in my database. Is there a way of doing this???
CodePudding user response:
You can easily do that using your pipeline -
class BotPipeline:
def process_item(self, item, spider):
if item['category'] == 'business':
# insert db operation with this filtered item
return item
if item['category'] == 'sports':
# insert db operation with this filtered item
return item