I'm scraping to .csv and I'm getting many extra spaces in the .csv file that are not in the actual web page. I'm able to remove tabs and line breaks using .replace()
but the spaces don't get removed using .replace()
. Even if there was something unusual in the formatting of the web page, it should get removed by the .replace()
. What am I missing?
import re
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class DhqSpider(CrawlSpider):
name = 'dhq1'
allowed_domains = ['digitalhumanities.org']
start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']
rules = (
Rule(LinkExtractor(allow = 'index.html')),
Rule(LinkExtractor(allow = 'vol'), callback='parse_article'),
)
def parse_article(self, response):
yield {
'title' : response.css('h1.articleTitle::text').get().replace('\n', '').replace('\t', '').replace('\s ', ' '),
'author1' : response.css('div.author a::text').getall(),
'year' : response.css('div#pubInfo::text')[0].get(),
'volume' : response.css('div#pubInfo::text')[1].get(),
'xmllink' : response.urljoin(response.xpath('(//div[@]/a[contains(@href, ".xml")]/@href)[1]').get()),
}
Piece of the csv. https://pastebin.com/7GvZT3b9
Link to one of the pages that's included in the .csv. http://www.digitalhumanities.org/dhq/vol/16/3/000629/000629.html
CodePudding user response:
You can use normalize-space.
Either replace the css selectors with xpath selectors, or remove the ::text
from the css selector and use xpath with normalize-space
after the css selector, as shown in the example.
Example:
import scrapy
import unidecode # to remove "\xa0" from the strings
class DhqSpider(scrapy.Spider):
name = 'dhq1'
allowed_domains = ['digitalhumanities.org']
start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/000629/000629.html']
def parse(self, response):
item = {
'title': response.css('h1.articleTitle').xpath('normalize-space(text())').get(default='').strip(),
'author1': response.css('div.author a').xpath('normalize-space(text())').getall(),
'year': unidecode.unidecode(response.css('div#pubInfo::text')[0].get()),
'volume': unidecode.unidecode(response.css('div#pubInfo::text')[1].get()),
'xmllink': response.urljoin(response.xpath('(//div[@]/a[contains(@href, ".xml")]/@href)[1]').get()),
}
item['author1'] = [unidecode.unidecode(i) for i in item['author1']]
yield item
Output:
{'title': 'Ethical and Effective Visualization of Knowledge Networks', 'author1': ['Chelsea Canon', 'canon_at_nevada_dot_unr_dot_edu', ' https://orcid.org/0000-0002-0431-343X', 'Douglas Boyle', 'douglasb_at_unr_dot_edu', ' https://orcid.org/0000-0002-3301-3997', 'K. J. Hepworth', 'katherine_dot_hepworth_at_unisa_dot_edu_dot_au', ' https://orcid.org/0000-0003-1059-567X'], 'year': '2022', 'volume': 'Volume 16 Number 3', 'xmllink': 'http://www.digitalhumanities.org/dhq/vol/16/3/000629.xml'}
CodePudding user response:
Web pages often have a special type of space, called a non-breaking space, represented in HTML vi   or .
Once you know the pattern, you can then replace with your standard Python replace()
call.