Home > database >  Crawler returning crawled results with \n's, how to get rid of these
Crawler returning crawled results with \n's, how to get rid of these

Time:12-01

The intention of this crawler is to return all the texts on a page along with the links, and we're trying to store the scraped data in json files, but the json files are coming with outputs containing redundancies such as the \n 's

Here is the scrapy spider:

import itemloaders
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from crawl.items import SpideyItem



class crawler(CrawlSpider):
    name = 'spidey'
    start_urls = ['https://quotes.toscrape.com/page/']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )
    custom_settings = {
        'DEPTH_LIMIT': 1,
        'DEPTH_PRIORITY': 1,
    }

    def parse_item(self, response):

        item = dict()
        item['url'] = response.url.strip()
        item['title'] = response.meta['link_text'].strip()
        # extracting basic body
        item['body'] = '\n'.join(response.xpath(
            '//text()').extract())
        # or better just save whole source
        #item['source'] = response.body

        yield item

Example output in a json file:

{"url": "https://quotes.toscrape.com/tag/miracles/page/1/", "title": "miracles", "body": "\n\n\n\t\n\n\t\nQuotes to Scrape\n\n    \n\n    \n\n\n\n\n\n    \n\n        \n\n            \n\n                \n\n                    \nQuotes to Scrape\n\n                \n\n            \n\n            \n\n                \n\n                \n                    \nLogin\n\n                \n                \n\n            \n\n        \n\n    \n\n\nViewing tag: \nmiracles\n\n\n\n\n    \n\n\n    \n\n        \n\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d\n\n        \nby \nAlbert Einstein\n\n        \n(about)\n\n        \n\n        \n\n            Tags:\n            \n \n            \n            \ninspirational\n\n            \n            \nlife\n\n            \n            \nlive\n\n            \n            \nmiracle\n\n            \n            \nmiracles\n\n            \n        \n\n    \n\n\n    \n\n        \n\n            \n            \n        \n\n    \n\n    \n\n    \n\n        \n            \nTop Ten tags\n\n            \n            \n\n            \nlove\n\n            \n\n            \n            \n\n            \ninspirational\n\n            \n\n            \n            \n\n            \nlife\n\n            \n\n            \n            \n\n            \nhumor\n\n            \n\n            \n            \n\n            \nbooks\n\n            \n\n            \n            \n\n            \nreading\n\n            \n\n            \n            \n\n            \nfriendship\n\n            \n\n            \n            \n\n            \nfriends\n\n            \n\n            \n            \n\n            \ntruth\n\n            \n\n            \n            \n\n            \nsimile\n\n            \n\n            \n        \n    \n\n\n\n\n    \n\n    \n\n        \n\n            \n\n                Quotes by: \nGoodReads.com\n\n            \n\n            \n\n                Made with \n\u2764\n by \nScrapinghub\n\n            \n\n        \n\n    \n\n\n\n"},

how to fix this?

CodePudding user response:

One possible answer to your question, exactly as written, is to use replace:

>>> "A lot of newline\n\n\n    characters\n\n\n\n\n\n\n\n\n\n\n\n\n".replace("\n", "")
'A lot of newline    characters'

Cleaning scraped contents is often a bit more involved, though. You typically don't want to unconditionally remove all newline characters, and another thing may be the presence of excessive whitespace (such as in your example). For those situations you may want to use regex instead. A very simple example is:

>>> s = "A lot of newline\n\n\n  \t\t  characters\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> re.sub("(\s) ", r"\1", s)
'A lot of newline characters\n'

The above expression is trivial, but regular expressions can be made very complicated to encode a lot of rules that can replace many lines of code when cleaning, searching or verifying textual data etc.

CodePudding user response:

The desired output is as follows:

Code:

import itemloaders
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from crawl.items import SpideyItem


class Crawler(CrawlSpider):
    name = 'spidey'
    allowed_domains =['quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/page/1/']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )
    custom_settings = {
        'DEPTH_LIMIT': 1,
        'DEPTH_PRIORITY': 1,
    }

    def parse_item(self, response):

        item = dict()
        item['url'] = response.url.strip()
        item['title'] = response.meta['link_text'].strip()
        # extracting basic body
        item['body'] = [' '.join(x.get().strip() for x in response.xpath('//text()'))]
        # or better just save whole source
        #item['source'] = response.body

        yield item

Output:

{'url': 'https://quotes.toscrape.com/', 'title': 'Quotes to Scrape', 'body': ["   Quotes to Scrape          Quotes to Scrape      Login        “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”  by Albert Einstein  (about)   Tags:  change  deep-thoughts  thinking  world 
    “It is our choices, Harry, that show what we truly are, far more than our abilities.”  by J.K. Rowling  (about)   Tags:  abilities  choices     “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”  by Albert Einstein  (about)   Tags:  inspirational  life  live  miracle  miracles     “The person, be it gentleman or lady, who has not pleasure in a good novel, must 
be intolerably stupid.”  by Jane Austen  (about)   Tags:  aliteracy  books  classic  humor     “Imperfection is 
beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”  by Marilyn Monroe  (about)   Tags:  be-yourself  inspirational     “Try not to become a man of success. Rather become a man of value.”  by Albert Einstein  (about)   Tags:  adulthood  success  value     “It is better to be hated for what you are than to be loved for what you are not.”  by André Gide  (about)   Tags:  life  love     “I have not failed. I've just found 10,000 ways that won't work.”  by Thomas A. Edison  (about)   Tags:  edison  failure  inspirational  paraphrased     “A woman is like a tea bag; you never know how strong it is until it's in hot water.”  by Eleanor Roosevelt  (about)   Tags:  misattributed-eleanor-roosevelt     “A day without sunshine is like, you know, night.”  by Steve Martin  (about)   Tags:  humor  obvious  simile       Next →       Top Ten tags   love    
inspirational    life    humor    books    reading    friendship    friends    truth    simile        Quotes by: GoodReads.com   Made with ❤ by Scrapinghub     "]}

... so on

  • Related