unknown characters "Ø³Ù‚ÙˆØ·" are scraped instead of encoding utf-8-CodePudding

I'm trying to scraped a Non-English website (https://arzdigital.com/). Here is my spider code. The problem is although at the beginning I import "urllib.parse" and in the settings.py file I wrote

FEED_EXPORT_ENCODING='utf-8'

the spider doesn't encode properly (the output is like this: "Ø³Ù‚ÙˆØ· Û±Û° Ù‡Ø²Ø§Ø± Ø¯Ù„Ø§Ø±ÛŒ Ø¨ÛŒØª Ú©ÙˆÛŒÙ† Ø¯Ø± Ø¹Ø±Ø¶ ÛŒÚ© Ø³Ø§Ø¹ØªØ› Ø¹Ù„Øª Ú†Ù‡ Ø¨ÙˆØ¯ØŸ"). Even using .encode() function didn't work.

So, here is my spider code:

# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/صفحهٔ_اصلی'


class CriptolernSpider(scrapy.Spider):
    name = 'criptolern'
    allowed_domains = ['arzdigital.com']


    start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in enter code hererange(1,353)]

    def parse(self, response):
        posts=response.xpath("//a[@class='arz-last-post arz-row']")
        
        try:

            for post in posts:
                post_title=post.xpath(".//@title").get()
                yield{
                    'post_title':post_title
                }
        except AttributeError:
            logging.error("The element didn't exist")

Can anybody tell me where the problem is? Thank you so much!

CodePudding user response：

In the response headers there is a charset, otherwise it defaults to Windows-1252. If you find a charset ISO-8859-1 substitute it with Windows-1252.

Now you have the right encoding to read it.

Best store all in full Unicode, UTF-8, so every script is possible.

It may be you are looking at the output with a console (on Windows most likely not UTF-8), and then you will see multi-byte sequences as two weird chars. Store it in a file, and edit it with Notepad or the like, where you can see the encoding and change it. Nowadays even Windows Notepad sometimes recognizes UTF-8.