Home > Blockchain >  Increase items count web-scrapping
Increase items count web-scrapping

Time:09-30

I am a beginner with Scrapy framework and I have 2 questions/problems:

  1. I made a "scrapy.Spider" for a website, but it stops after 960 elements retrieved, how can I increase this value, I need to retrieve about ~1600 elements .... :/
  2. Is it possible to launch scrapy infinitely by adding a waiting time for each "scrapy.Spider"?

UPDATED

class Spell(scrapy.Item):
    name = scrapy.Field()
    level = scrapy.Field()
    components = scrapy.Field()
    resistance = scrapy.Field()

class Pathfinder2Spider(scrapy.Spider):
    name = "Pathfinder2"
    allowed_domains = ["d20pfsrd.com"]
    start_urls = ["https://www.d20pfsrd.com/magic/spell-lists-and-domains/spell-lists-sorcerer-and-wizard/"]

    def parse(self, response):
        # Recovering all wizard's spell links
        spells_links = response.xpath('//div/table/tbody/tr/td/a[has-class("spell")]')
        print("len(spells_links) : ", len(spells_links))
        for spell_link in spells_links:
            url = spell_link.xpath('@href').get().strip()
            # Recovering all spell information
            yield response.follow(url, self.parse_spell)
        
    def parse_spell(self, response):
        # Getting all content from spell
        article = response.xpath('//article[has-class("magic")]')
        contents = article.xpath('//div[has-class("article-content")]')
        # Extract useful information
        all_names = article.xpath("h1/text()").getall()
        all_contents = contents.get()
        all_levels = RE_LEVEL.findall(all_contents)
        all_components = RE_COMPONENTS.findall(all_contents)
        all_resistances = RE_RESISTANCE.findall(all_contents)

        for name, level, components, resistance in zip(all_names, all_levels, all_components, all_resistances):

            # Treatment here ...

            yield Spell(
                name=spell_name,
                level=spell_level,
                components=spell_components,
                resistance=spell_resistance,
            )

There are total of 1600 links

len(spells_links) : 1565

BUT Only 953 scraped

 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/404': 2,
 'item_scraped_count': 953,

I run spider with this command Scrapy crawl Pathfinder2 -O XXX.json"

enter image description here

So the number of links is bigger than the number of items.

  • Related