Home > Mobile >  set limit to pages for scrapy
set limit to pages for scrapy

Time:05-16

I'm scraping https://myanimelist.net/anime.php#/ and you can see there is genres section I want to return as a csv only first 18 pages and stop before explicit genres How i can do that? here is my code

# -*- coding: utf-8 -*-
import scrapy
from scrapy.exceptions import CloseSpider

class Link(scrapy.Item):
    link = scrapy.Field()

class LinkListsSpider(scrapy.Spider):    
    name = 'link_lists'
    allowed_domains = ['https://myanimelist.net/']
    start_urls = ['https://myanimelist.net/anime.php#/']

    def parse(self, response):

        xpath = '//a[re:test(@class, "genre-name-link")]/@href'
        selection = response.xpath(xpath)
        for s in selection :
            l = Link()
            l['link'] = 'https://en.wikipedia.org'   s.get()
            yield l

CodePudding user response:

Don't think of it as "setting a limit to pages". You may see "pages" in the list of links, but scrapy doesn't see pages. It sees a giant piece of HTML. Also don't think of scraping as scanning the page the way your eye does. Your job is to use selectors like a knife to carve out the section you want to look at. You use XPath to navigate to and draw the boundaries around that section.

The method I used is to identify the section called Genres, then collect all the links under that section only. Since that section is actually the next sibling of the title div (rather than a descendant as you might think by looking at it), I used the following-sibling axis, then [1] to "go to the next (1) div (which contains the 18 Genres) and collect all links from under that."

In other words, the HTML looks like this:

<div>Genres</div>
<div >
    -- Anime Genre Links here --
</div>
<div>Explicit Genres</div>
<div >
    -- Explicit Genre Links here --

So the way you navigate this is to locate <div>Genres</div>, then hop to its following sibling (the next div), then look for links inside that.

class LinkListsSpider(scrapy.Spider):
    name = 'link_lists'
    allowed_domains = ['https://myanimelist.net/']
    start_urls = ['https://myanimelist.net/anime.php#/']

    def parse(self, response, **kwargs):
        xpath = '//div[text()="Genres"]/following-sibling::div[@][1]//a/@href'
        selection = response.xpath(xpath)
        for s in selection:
            l = Link()
            l['link'] = 'https://en.wikipedia.org'   s.get()
            yield l

Make sure you add **kwargs to your parse function arguments btw, so it more accurately matches the base class signature.

CodePudding user response:

Please use @Steven answer. I just want to illustrate how to get the first 18 links from the page using XPath:

'(//a[@])[position() <= 18]/@href'
  • Related