Home > Mobile >  What happens with already followed links in scrapy?
What happens with already followed links in scrapy?

Time:05-28

I have a spider that, let's say, follows all links in a website using the 'response.follow' method, and it does this recursively. It can find the same link many times, but I understand that by default already followed links are not followed again in the last versions of scrapy. Is this true? I can't find a lot of information about this. In case it's true, would it just stop crawling when all possible link are exhausted and therefore every yielded request is repeated?

CodePudding user response:

Scrapy has built-in duplicate filtering which is turned on by default. That is if Scrapy has already crawled a site and parsed the response, even if you yield another request with that URL, scrapy will not process it. But you can set dont_filter=True and disable that.

From the documentation

dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

So, yeah. It will just stop crawling when all possible link are exhausted and it will filter out the duplicate links

  • Related