Home > Software design >  Python Scrapy Stops after some Request without any error
Python Scrapy Stops after some Request without any error

Time:10-30

I am trying to scrape some data with Scrapy(2.5.0) Python (3.6.0)

Scrapy Works for some urls around 70 to 100 after it Just Quit with Spider closed (finished) Without any Error

But there is more then 200K requests to make

import scrapy 
from scrapy.linkextractors import LinkExtractor
# import pandas as pd
import pymongo

client = pymongo.MongoClient("mongodb srv://<user>:<Password>@booksmotionscraper.9c8us.mongodb.net/booksmotion?retryWrites=true&w=majority")

db= client.libgen.libgen2

start = True
class lSpider(scrapy.Spider): 
  name = "libgen_dlink"
  start_urls = [ 
          "https://booksmotion.com/main/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
   ]

  def parse(self, response):
    global start
    link= db.find_one({})
    
    url= 'https://booksmotion.com/main/' link['md5']
    yield scrapy.Request(url, callback=self.parse)
    # link= list(link)
    # print(link)
    db.delete_one({'_id': link['_id']})
    body= response.css('body')
    try:
      info={
        'md5': response.url.rsplit('/', 1)[-1],
        'dlink': body.css('#download > ul > li:nth-child(2) > a').attrib['href']
      }
    except KeyError:
      info={
        'md5': response.url.rsplit('/', 1)[-1],
        'dlink': 0
      }
    yield{
      'md5': info['md5'],
      'dlink': info['dlink']
    }

CodePudding user response:

Due to Some duplicates in Database Scrapy By default skip that url and so on there is no more urls to scrape because urls are added in parse function and purse function does not get called when there is duplicate url, so there is no more urls to scrape and scrapy closes the spider.

adding dont_filter= True fixes the problem

  yield scrapy.Request(url, dont_filter=True,  callback=self.parse)
  • Related