I'm using scrapy and crawlspinder. I want to get all posts on this website. Here is the code.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ToscrapeSpider(CrawlSpider):
name = 'toscrape'
allowed_domains = ['pstrial-2019-12-16.toscrape.com']
start_urls = ['http://pstrial-2019-12-16.toscrape.com/browse/insunsh']
rule_articles_page = Rule(LinkExtractor(restrict_xpaths="//div[@id='body']/div[2]/a"), callback='parse_item', follow=False)
rule_next_page = Rule(LinkExtractor(restrict_xpaths="//form[@class='nav next']/input[1]/@value", tags=('input'), attrs=('value',), process_value='process_value'),
follow=True,)
rules = (
rule_articles_page,
rule_next_page,
)
def parse_item(self, response):
yield {
'Image': response.xpath("//div[@id='body']/img/@src").extract(),
'Title': response.xpath("//div[@id='content']/h1/text()").extract(),
'artist': response.xpath("//div[@id='content']/h2/text()").extract(),
'Description': response.xpath("//div[@class='description']/p/text()").extract(),
'URL': response.url,
'Dimention' : response.xpath("//tbody/tr/td[text()='Dimensions']/text()").extract(),
}
Now the problem is it does not go to the next page. Because the next page button is a form, not an anchor tag.
Also, Help me to get image dimensions (if available in cm) on the article page.
CodePudding user response:
this is the basic loop i created maybe you can find better one but this also work on your problem.
import scrapy
class PagedataSpider(scrapy.Spider):
name = 'pagedata'
page=1
allowed_domains = ['pstrial-2019-12-16.toscrape.com']
start_urls = ['http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=1']
def parse(self, response):
yield {
'Title': response.css("div h1::text").getall()
}
# next_page=response.css('input[name="page"]::attr(value)').get()
if PagedataSpider.page <=114:
PagedataSpider.page =1
nextPage=f'http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page={PagedataSpider.next_page}'
yield scrapy.Request(nextPage,callback=self.parse)
CodePudding user response:
To get the dimensions remove 'tbody' from the xpath, and search for the next sibiling after the 'key' td (to get the 'value' td text).
Set 'process_value' to get the page number correctly.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ToscrapeSpider(CrawlSpider):
name = 'toscrape'
allowed_domains = ['pstrial-2019-12-16.toscrape.com']
start_urls = ['http://pstrial-2019-12-16.toscrape.com/browse/insunsh']
rule_articles_page = Rule(LinkExtractor(restrict_xpaths="//div[@id='body']/div[2]/a"), callback='parse_item', follow=False)
rule_next_page = Rule(LinkExtractor(restrict_xpaths="//form[@class='nav next']/input[1]", tags=('input'), attrs=('value',),
process_value=(lambda x: f'?page={x[-1]}')),
follow=True,)
rules = (
rule_articles_page,
rule_next_page,
)
def parse_item(self, response):
yield {
'Image': response.xpath("//div[@id='body']/img/@src").get(),
'Title': response.xpath("//div[@id='content']/h1/text()").get(),
'artist': response.xpath("//div[@id='content']/h2/text()").get(),
'Description': response.xpath("//div[@class='description']/p/text()").get(),
'URL': response.url,
'Dimension': response.xpath('//tr/td[text()="Dimensions"]/following-sibling::td[@]/text()').get(),
}
Output:
[scrapy.core.scraper] DEBUG: Scraped from <200 http://pstrial-2019-12-16.toscrape.com/item/12125/Front_Panel_for_Blouse?back=155>
{'Image': '/content/12125.jpg', 'Title': 'Front Panel for Blouse', 'artist': None, 'Description': 'reverse patchwork of animal designs in blue, yellow, red and orange', 'URL': 'http://pstrial-2019-12-16.toscrape.com/item/12125/Front_Panel_for_Blouse?back=155', 'Dimension': '16 3/4 x 22 1/8in. (42.5 x 56.2cm)'}
...
...
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=2> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=1)
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/item/12712/The_Muses_of_Music_and_Poetry?back=155> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=1)
[scrapy.core.scraper] DEBUG: Scraped from <200 http://pstrial-2019-12-16.toscrape.com/item/12712/The_Muses_of_Music_and_Poetry?back=155>
{'Image': '/content/12712.jpg', 'Title': 'The Muses of Music and Poetry', 'artist': 'Sculptor: Guillaume Coustou the Younger', 'Description': None, 'URL': 'http://pstrial-2019-12-16.toscrape.com/item/12712/The_Muses_of_Music_and_Poetry?back=155', 'Dimension': '24 5/8 x 15 1/2 x 10 1/2 in. (62.55 x 39.37 x 26.67 cm)'}
...
...
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=3> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=2)
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://pstrial-2019-12-16.toscrape.com/item/13484/Don_Pascual_y_su_esposa?back=155> (referer: http://pstrial-2019-12-16.toscrape.com/browse/insunsh?page=2)
...
...
...