I'm trying to scraped a Non-English website (https://arzdigital.com/). Here is my spider code. The problem is although at the beginning I import "urllib.parse" and in the settings.py file I wrote
FEED_EXPORT_ENCODING='utf-8'
the spider doesn't encode properly (the output is like this: "سقوط ۱۰ هزار دلاری بیت کوین در عرض یک ساعت؛ علت چه بود؟"). Even using .encode() function didn't work.
So, here is my spider code:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/صفحهٔ_اصلی'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in enter code hererange(1,353)]
def parse(self, response):
posts=response.xpath("//a[@class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//@title").get()
yield{
'post_title':post_title
}
except AttributeError:
logging.error("The element didn't exist")
Can anybody tell me where the problem is? Thank you so much!
CodePudding user response:
In the response headers there is a charset, otherwise it defaults to Windows-1252
.
If you find a charset ISO-8859-1
substitute it with Windows-1252
.
Now you have the right encoding to read it.
Best store all in full Unicode, UTF-8
, so every script is possible.
It may be you are looking at the output with a console (on Windows most likely not UTF-8), and then you will see multi-byte sequences as two weird chars. Store it in a file, and edit it with Notepad or the like, where you can see the encoding and change it. Nowadays even Windows Notepad sometimes recognizes UTF-8.