Home > database >  Remove duplicate value using scrapy
Remove duplicate value using scrapy

Time:07-25

There are 695 record in page but they gave 954 record so there are duplicate value in it so how I remove duplicate value so they gave me only 695 record these is page link http://www.palatakd.ru/list/

import scrapy
from scrapy.http import Request

class PushpaSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://www.palatakd.ru/list/']
    page_number=1
   
    
    def parse(self, response):
        details=response.xpath("//p[@class='detail_block']")
        for detail in details:
            registration=detail.xpath(".//span[contains(.,'Регистрационный номер адвоката в реестре')]//following-sibling::span//text()").get()
            address=detail.xpath(".//span[contains(.,'Адрес')]//following-sibling::span//text()").get()
            phone=detail.xpath(".//span[contains(.,'Телефон')]//following-sibling::span//text()").get()
            fax=detail.xpath(".//span[contains(.,'Факс')]//following-sibling::span//text()").get()
            yield{
                'Телефон':phone,
                'Факс':fax,
                'Регистрационный номер адвоката в реестре':registration,
                'Адрес':address
            
            }
            next_page = 'http://www.palatakd.ru/list/?PAGEN_1='   str(PushpaSpider.page_number)
            
            if PushpaSpider.page_number<=3:
                PushpaSpider.page_number  = 1
                yield response.follow(next_page, callback = self.parse)

CodePudding user response:

You can enable your item pipeline to filter out duplicates.

for example:

In your settings.py file turn on (uncomment) your ITEM_PIPELINES

ITEM_PIPELINES = {
   'project.pipelines.ProjectPipeline': 300,
}

in your pipelines.py file filter out the duplicate items.

from scrapy.exceptions import DropItem

class ProjectPipeline:
    itemlist = []

    def process_item(self, item, spider):
        if item in self.itemlist:
            raise DropItem
        self.itemlist.append(item)
        return item

No adjustments need to be made to your spider.

  • Related