I am trying to scrape some data with Scrapy(2.5.0) Python (3.6.0)
Scrapy Works for some urls around 70 to 100 after it Just Quit with Spider closed (finished) Without any Error
But there is more then 200K requests to make
import scrapy
from scrapy.linkextractors import LinkExtractor
# import pandas as pd
import pymongo
client = pymongo.MongoClient("mongodb srv://<user>:<Password>@booksmotionscraper.9c8us.mongodb.net/booksmotion?retryWrites=true&w=majority")
db= client.libgen.libgen2
start = True
class lSpider(scrapy.Spider):
name = "libgen_dlink"
start_urls = [
"https://booksmotion.com/main/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
]
def parse(self, response):
global start
link= db.find_one({})
url= 'https://booksmotion.com/main/' link['md5']
yield scrapy.Request(url, callback=self.parse)
# link= list(link)
# print(link)
db.delete_one({'_id': link['_id']})
body= response.css('body')
try:
info={
'md5': response.url.rsplit('/', 1)[-1],
'dlink': body.css('#download > ul > li:nth-child(2) > a').attrib['href']
}
except KeyError:
info={
'md5': response.url.rsplit('/', 1)[-1],
'dlink': 0
}
yield{
'md5': info['md5'],
'dlink': info['dlink']
}
CodePudding user response:
Due to Some duplicates in Database Scrapy By default skip that url and so on there is no more urls to scrape because urls are added in parse function and purse function does not get called when there is duplicate url, so there is no more urls to scrape and scrapy closes the spider.
adding dont_filter= True fixes the problem
yield scrapy.Request(url, dont_filter=True, callback=self.parse)