Home > Software design >  Scrapy Scrap crawlspider next page with input tag
Scrapy Scrap crawlspider next page with input tag

Time:11-11

I'm using scrapy and crawlspinder. I want to get all posts on this website. Here is the code.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ToscrapeSpider(CrawlSpider):
    name = 'toscrape'
    allowed_domains = ['pstrial-2019-12-16.toscrape.com']
    start_urls = ['http://pstrial-2019-12-16.toscrape.com/browse/insunsh']

    rule_articles_page = Rule(LinkExtractor(restrict_xpaths="//div[@id='body']/div[2]/a"), callback='parse_item', follow=False)
    rule_next_page = Rule(LinkExtractor(restrict_xpaths="//form[@class='nav next']/input[1]/@value", tags=('input'), attrs=('value',), process_value='process_value'),
                              follow=True,)

    rules = (
        rule_articles_page,
        rule_next_page,
    )
    def parse_item(self, response):
        yield {
            'Image': response.xpath("//div[@id='body']/img/@src").extract(),
            'Title': response.xpath("//div[@id='content']/h1/text()").extract(),
            'artist': response.xpath("//div[@id='content']/h2/text()").extract(),
            'Description': response.xpath("//div[@class='description']/p/text()").extract(),
            'URL': response.url,
            'Dimention' : response.xpath("//tbody/tr/td[text()='Dimensions']/text()").extract(),

        }

Now the problem is it does not go to the next page. Because the next page button is a form, not an anchor tag.

Also, Help me to get image dimensions (if available in cm) on the article page.

CodePudding user response:

this is the basic loop i created maybe you can find better one but this also work on your problem.

import scrapy

class PagedataSpider(scrapy.Spider):

    name = 'pagedata'
    page=1
    allowed_domains = ['pstrial-2019-12-16.toscrape.com']
    start_urls = ['http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=1']

    def parse(self, response):

        yield {
            'Title': response.css("div h1::text").getall()
        }

        # next_page=response.css('input[name="page"]::attr(value)').get()
        
        
        if PagedataSpider.page <=114:
            PagedataSpider.page =1
            nextPage=f'http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page={PagedataSpider.next_page}'

            yield scrapy.Request(nextPage,callback=self.parse)

CodePudding user response:

  1. To get the dimensions remove 'tbody' from the xpath, and search for the next sibiling after the 'key' td (to get the 'value' td text).

  2. Set 'process_value' to get the page number correctly.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ToscrapeSpider(CrawlSpider):
    name = 'toscrape'
    allowed_domains = ['pstrial-2019-12-16.toscrape.com']
    start_urls = ['http://pstrial-2019-12-16.toscrape.com/browse/insunsh']

    rule_articles_page = Rule(LinkExtractor(restrict_xpaths="//div[@id='body']/div[2]/a"), callback='parse_item', follow=False)
    rule_next_page = Rule(LinkExtractor(restrict_xpaths="//form[@class='nav next']/input[1]", tags=('input'), attrs=('value',),
                                        process_value=(lambda x: f'?page={x[-1]}')),
                          follow=True,)

    rules = (
        rule_articles_page,
        rule_next_page,
    )

    def parse_item(self, response):
        yield {
            'Image': response.xpath("//div[@id='body']/img/@src").get(),
            'Title': response.xpath("//div[@id='content']/h1/text()").get(),
            'artist': response.xpath("//div[@id='content']/h2/text()").get(),
            'Description': response.xpath("//div[@class='description']/p/text()").get(),
            'URL': response.url,
            'Dimension': response.xpath('//tr/td[text()="Dimensions"]/following-sibling::td[@]/text()').get(),
        }

Output:

[scrapy.core.scraper] DEBUG: Scraped from <200 http://pstrial-2019-12-16.toscrape.com/item/12125/Front_Panel_for_Blouse?back=155>
{'Image': '/content/12125.jpg', 'Title': 'Front Panel for Blouse', 'artist': None, 'Description': 'reverse patchwork of animal designs in blue, yellow, red and orange', 'URL': 'http://pstrial-2019-12-16.toscrape.com/item/12125/Front_Panel_for_Blouse?back=155', 'Dimension': '16 3/4 x 22 1/8in. (42.5 x 56.2cm)'}
...
...
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=2> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=1)
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/item/12712/The_Muses_of_Music_and_Poetry?back=155> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=1)
[scrapy.core.scraper] DEBUG: Scraped from <200 http://pstrial-2019-12-16.toscrape.com/item/12712/The_Muses_of_Music_and_Poetry?back=155>
{'Image': '/content/12712.jpg', 'Title': 'The Muses of Music and Poetry', 'artist': 'Sculptor: Guillaume Coustou the Younger', 'Description': None, 'URL': 'http://pstrial-2019-12-16.toscrape.com/item/12712/The_Muses_of_Music_and_Poetry?back=155', 'Dimension': '24 5/8 x 15 1/2 x 10 1/2 in. (62.55 x 39.37 x 26.67 cm)'}
...
...
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=3> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=2)
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/item/13484/Don_Pascual_y_su_esposa?back=155> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=2)
...
...
...
  • Related