The intention of this crawler is to return all the texts on a page along with the links, and we're trying to store the scraped data in json files, but the json files are coming with outputs containing redundancies such as the \n 's
Here is the scrapy spider:
import itemloaders
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from crawl.items import SpideyItem
class crawler(CrawlSpider):
name = 'spidey'
start_urls = ['https://quotes.toscrape.com/page/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
custom_settings = {
'DEPTH_LIMIT': 1,
'DEPTH_PRIORITY': 1,
}
def parse_item(self, response):
item = dict()
item['url'] = response.url.strip()
item['title'] = response.meta['link_text'].strip()
# extracting basic body
item['body'] = '\n'.join(response.xpath(
'//text()').extract())
# or better just save whole source
#item['source'] = response.body
yield item
Example output in a json file:
{"url": "https://quotes.toscrape.com/tag/miracles/page/1/", "title": "miracles", "body": "\n\n\n\t\n\n\t\nQuotes to Scrape\n\n \n\n \n\n\n\n\n\n \n\n \n\n \n\n \n\n \nQuotes to Scrape\n\n \n\n \n\n \n\n \n\n \n \nLogin\n\n \n \n\n \n\n \n\n \n\n\nViewing tag: \nmiracles\n\n\n\n\n \n\n\n \n\n \n\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d\n\n \nby \nAlbert Einstein\n\n \n(about)\n\n \n\n \n\n Tags:\n \n \n \n \ninspirational\n\n \n \nlife\n\n \n \nlive\n\n \n \nmiracle\n\n \n \nmiracles\n\n \n \n\n \n\n\n \n\n \n\n \n \n \n\n \n\n \n\n \n\n \n \nTop Ten tags\n\n \n \n\n \nlove\n\n \n\n \n \n\n \ninspirational\n\n \n\n \n \n\n \nlife\n\n \n\n \n \n\n \nhumor\n\n \n\n \n \n\n \nbooks\n\n \n\n \n \n\n \nreading\n\n \n\n \n \n\n \nfriendship\n\n \n\n \n \n\n \nfriends\n\n \n\n \n \n\n \ntruth\n\n \n\n \n \n\n \nsimile\n\n \n\n \n \n \n\n\n\n\n \n\n \n\n \n\n \n\n Quotes by: \nGoodReads.com\n\n \n\n \n\n Made with \n\u2764\n by \nScrapinghub\n\n \n\n \n\n \n\n\n\n"},
how to fix this?
CodePudding user response:
One possible answer to your question, exactly as written, is to use replace
:
>>> "A lot of newline\n\n\n characters\n\n\n\n\n\n\n\n\n\n\n\n\n".replace("\n", "")
'A lot of newline characters'
Cleaning scraped contents is often a bit more involved, though. You typically don't want to unconditionally remove all newline characters, and another thing may be the presence of excessive whitespace (such as in your example). For those situations you may want to use regex instead. A very simple example is:
>>> s = "A lot of newline\n\n\n \t\t characters\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> re.sub("(\s) ", r"\1", s)
'A lot of newline characters\n'
The above expression is trivial, but regular expressions can be made very complicated to encode a lot of rules that can replace many lines of code when cleaning, searching or verifying textual data etc.
CodePudding user response:
The desired output is as follows:
Code:
import itemloaders
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from crawl.items import SpideyItem
class Crawler(CrawlSpider):
name = 'spidey'
allowed_domains =['quotes.toscrape.com']
start_urls = ['https://quotes.toscrape.com/page/1/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
custom_settings = {
'DEPTH_LIMIT': 1,
'DEPTH_PRIORITY': 1,
}
def parse_item(self, response):
item = dict()
item['url'] = response.url.strip()
item['title'] = response.meta['link_text'].strip()
# extracting basic body
item['body'] = [' '.join(x.get().strip() for x in response.xpath('//text()'))]
# or better just save whole source
#item['source'] = response.body
yield item
Output:
{'url': 'https://quotes.toscrape.com/', 'title': 'Quotes to Scrape', 'body': [" Quotes to Scrape Quotes to Scrape Login “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein (about) Tags: change deep-thoughts thinking world
“It is our choices, Harry, that show what we truly are, far more than our abilities.” by J.K. Rowling (about) Tags: abilities choices “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by Albert Einstein (about) Tags: inspirational life live miracle miracles “The person, be it gentleman or lady, who has not pleasure in a good novel, must
be intolerably stupid.” by Jane Austen (about) Tags: aliteracy books classic humor “Imperfection is
beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” by Marilyn Monroe (about) Tags: be-yourself inspirational “Try not to become a man of success. Rather become a man of value.” by Albert Einstein (about) Tags: adulthood success value “It is better to be hated for what you are than to be loved for what you are not.” by André Gide (about) Tags: life love “I have not failed. I've just found 10,000 ways that won't work.” by Thomas A. Edison (about) Tags: edison failure inspirational paraphrased “A woman is like a tea bag; you never know how strong it is until it's in hot water.” by Eleanor Roosevelt (about) Tags: misattributed-eleanor-roosevelt “A day without sunshine is like, you know, night.” by Steve Martin (about) Tags: humor obvious simile Next → Top Ten tags love
inspirational life humor books reading friendship friends truth simile Quotes by: GoodReads.com Made with ❤ by Scrapinghub "]}
... so on