Home > Net >  Scrapy returning addiitonal spaces in my .csv?
Scrapy returning addiitonal spaces in my .csv?

Time:12-20

I'm scraping to .csv and I'm getting many extra spaces in the .csv file that are not in the actual web page. I'm able to remove tabs and line breaks using .replace() but the spaces don't get removed using .replace(). Even if there was something unusual in the formatting of the web page, it should get removed by the .replace(). What am I missing?

import re
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DhqSpider(CrawlSpider):
    name = 'dhq1'
    allowed_domains = ['digitalhumanities.org']
    start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']

    rules = (
            Rule(LinkExtractor(allow = 'index.html')), 
            Rule(LinkExtractor(allow = 'vol'), callback='parse_article'),        
        )

    def parse_article(self, response):
        yield { 
            'title' : response.css('h1.articleTitle::text').get().replace('\n', '').replace('\t', '').replace('\s ', ' '),
            'author1' : response.css('div.author a::text').getall(),
            'year' : response.css('div#pubInfo::text')[0].get(),
            'volume' : response.css('div#pubInfo::text')[1].get(),
            'xmllink' : response.urljoin(response.xpath('(//div[@]/a[contains(@href, ".xml")]/@href)[1]').get()),            
        }

Piece of the csv. https://pastebin.com/7GvZT3b9

Link to one of the pages that's included in the .csv. http://www.digitalhumanities.org/dhq/vol/16/3/000629/000629.html

CodePudding user response:

You can use normalize-space.

Either replace the css selectors with xpath selectors, or remove the ::text from the css selector and use xpath with normalize-space after the css selector, as shown in the example.

Example:

import scrapy
import unidecode  # to remove "\xa0" from the strings


class DhqSpider(scrapy.Spider):
    name = 'dhq1'
    allowed_domains = ['digitalhumanities.org']
    start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/000629/000629.html']

    def parse(self, response):
        item = {
            'title': response.css('h1.articleTitle').xpath('normalize-space(text())').get(default='').strip(),
            'author1': response.css('div.author a').xpath('normalize-space(text())').getall(),
            'year': unidecode.unidecode(response.css('div#pubInfo::text')[0].get()),
            'volume': unidecode.unidecode(response.css('div#pubInfo::text')[1].get()),
            'xmllink': response.urljoin(response.xpath('(//div[@]/a[contains(@href, ".xml")]/@href)[1]').get()),
        }
        item['author1'] = [unidecode.unidecode(i) for i in item['author1']]
        yield item

Output:

{'title': 'Ethical and Effective Visualization of Knowledge Networks', 'author1': ['Chelsea Canon', 'canon_at_nevada_dot_unr_dot_edu', ' https://orcid.org/0000-0002-0431-343X', 'Douglas Boyle', 'douglasb_at_unr_dot_edu', ' https://orcid.org/0000-0002-3301-3997', 'K. J. Hepworth', 'katherine_dot_hepworth_at_unisa_dot_edu_dot_au', ' https://orcid.org/0000-0003-1059-567X'], 'year': '2022', 'volume': 'Volume 16 Number 3', 'xmllink': 'http://www.digitalhumanities.org/dhq/vol/16/3/000629.xml'}

CodePudding user response:

Web pages often have a special type of space, called a non-breaking space, represented in HTML vi &nbsp or   . Once you know the pattern, you can then replace with your standard Python replace() call.

  • Related