Home > database >  unknown characters "سقوط" are scraped instead of encoding utf-8
unknown characters "سقوط" are scraped instead of encoding utf-8

Time:12-05

I'm trying to scraped a Non-English website (https://arzdigital.com/). Here is my spider code. The problem is although at the beginning I import "urllib.parse" and in the settings.py file I wrote

FEED_EXPORT_ENCODING='utf-8'

the spider doesn't encode properly (the output is like this: "سقوط ۱۰ هزار دلاری بیت کوین در عرض یک ساعت؛ علت چه بود؟"). Even using .encode() function didn't work.

So, here is my spider code:

# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/صفحهٔ_اصلی'


class CriptolernSpider(scrapy.Spider):
    name = 'criptolern'
    allowed_domains = ['arzdigital.com']


    start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in enter code hererange(1,353)]

    def parse(self, response):
        posts=response.xpath("//a[@class='arz-last-post arz-row']")
        
        try:

            for post in posts:
                post_title=post.xpath(".//@title").get()
                yield{
                    'post_title':post_title
                }
        except AttributeError:
            logging.error("The element didn't exist")

Can anybody tell me where the problem is? Thank you so much!

CodePudding user response:

In the response headers there is a charset, otherwise it defaults to Windows-1252. If you find a charset ISO-8859-1 substitute it with Windows-1252.

Now you have the right encoding to read it.

Best store all in full Unicode, UTF-8, so every script is possible.

It may be you are looking at the output with a console (on Windows most likely not UTF-8), and then you will see multi-byte sequences as two weird chars. Store it in a file, and edit it with Notepad or the like, where you can see the encoding and change it. Nowadays even Windows Notepad sometimes recognizes UTF-8.

  • Related