Home > other >  Scrapy: scrape url in sequence and output repeated
Scrapy: scrape url in sequence and output repeated

Time:04-07

At the moment this crawler sort of work and give me a response but I have a few issues. The first is the sequence of the page scraped. I'd like that start from page 1 tot the range that I set, at this moment seems doing randomly and also repeat the pages. The second is the output, is all duplicate or with null value or not in order. I don't know if the problem is in the rule or in the crawler.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

            
class QuotesSpider(CrawlSpider):
    name = "catspider"
    start_urls = []
    for i in range(1,10):
        if i % 2 == 1:
            start_urls.append('https://www.worldcat.org/title/rose-in-bloom/oclc/'   str(i)  '&referer=brief_results')
            

    rules = (
        Rule(LinkExtractor(allow='title')),
        Rule(LinkExtractor(allow='oclc'), callback='parse_item')
    )


    def parse_item(self, response):
        yield {
            'title': response.css('h1.title::text').get(),
            'author': response.css('td[id="bib-author-cell"] a::text').getall(),
            'publisher': response.css('td[id="bib-publisher-cell"]::text').get(),
            'format': response.css('span[id="editionFormatType"] span::text').get(),
            'isbn': response.css('tr[id="details-standardno"] td::text').get(),
            'oclc':  response.css('tr[id="details-oclcno"] td::text').get()
        }

Extra info: from someone that have more experience with scrapy what is better and why, Xpath or css tag?

Thanks for any info.

CodePudding user response:

You can make the pagination in start_urls using for loop range method which type of pagination is 2 times faster than others.And one of the best way to use xpath in rules if each item contains link.

Extra info: from someone that have more experience with scrapy what is better and why, Xpath or css tag?

According to your comment Extra info: xpath and css element locators both are better but xpath a little bit richer because xpath easily moves up and down of the html tree and you can also apply xpath and css at the same time in mixture way. Here is a working example.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor 
from scrapy.crawler import CrawlerProcess   
        
class QuotesSpider(CrawlSpider):
    name = "catspider"
    start_urls = ['https://www.worldcat.org/search?q=oclc&fq=&dblist=638&start=' str(i) '1&qt=page_number_link' for i in range(1,11)]

    rules = (Rule(LinkExtractor(restrict_xpaths='//*[@]/a'), callback='parse_item', follow=True),)

    def parse_item(self, response):
        yield {
            'title' : response.css('h1.title::text').get(),
            'author' : response.css('td[id="bib-author-cell"] a::text').getall(),
            'publisher' : response.css('td[id="bib-publisher-cell"]::text').get(),
            'format' : response.css('span[id="editionFormatType"] span::text').get(),
            'isbn' : response.css('tr[id="details-standardno"] td::text').get(),
            'oclc' :  response.css('tr[id="details-oclcno"] td::text').get()
            }

process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
  • Related