Home > Software design >  encoding string to utf-8 leaves non english characters as byte strings
encoding string to utf-8 leaves non english characters as byte strings

Time:12-31

I am trying to use snscrape for twitter which stores tweet content as string. I am trying to save this to a text file but this doesn't parse non english characters the right way.

import snscrape.modules.twitter as sntwitter
# Creating list to append tweet data to
tweets_list1 = []
# Using TwitterSearchScraper to scrape data 
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:eenadulivenews').get_items()):
    if i>maxTweets:
        break
    print(tweet.content)

Here tweet.content is a string. I am trying to save this to a file using command line like

python main.py > output.txt

this gives me a error saying

UnicodeEncodeError: 'charmap' codec can't encode characters in position 5-10: character maps to <undefined>

So I am trying to convert this into utf-8 as my tweet is in one of the utf-8 supported languages.

import snscrape.modules.twitter as sntwitter
# Creating list to append tweet data to
tweets_list1 = []
# Using TwitterSearchScraper to scrape data 
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:eenadulivenews').get_items()):
    if i>maxTweets:
        break
    print(tweet.content.encode('utf-8'))

This works but it leaves non english characters as byte strings. Something like this

b'\xe0\xb0\x86 \xe0\xb0\xb5\xe0\xb0\x82\xe0\xb0\x9f\xe0\xb0\xb2\xe0\xb0\x95\xe0\xb1\x81 242 \xe0\xb0\x95\xe0\xb1\x8b\xe0\xb0\x9f\xe0\xb1\x8d\xe0\xb0\xb2\xe0\xb0\xae\xe0\xb0\x82\xe0\xb0\xa6\xe0\xb0\xbf \xe0\xb0\xb5\xe0\xb1\x80\xe0\xb0\x95\xe0\xb1\x8d\xe0\xb0\xb7\xe0\xb0\x95\xe0\xb1\x81\xe0\xb0\xb2\xe0\xb1\x81\n\xe0\xb0\xaf\xe0\xb1\x82\xe0\xb0\x9f\xe0\xb1\x8d\xe0\xb0\xaf\xe0\xb1\x82\xe0\xb0\xac\xe0\xb0\xb0\xe0\xb1\x8d\xe2\x80\x8c... \xe0\xb0\x88 \xe0\xb0\x98\xe0\xb0\xa8\xe0\xb0\xa4\xe0\xb0\xb2\xe0\xb0\xa8\xe0\xb1\x8d\xe0\xb0\xa8\xe0\xb1\x80 62 \xe0\xb0\x8f\xe0\xb0\xb3\xe0\xb1\x8d\xe0\xb0\xb2 \xe0\xb0\xa8\xe0\xb0\xbf\xe0\xb0\xb7\xe0\xb0\xbe \xe0\xb0\xae\xe0\xb0\xa7\xe0\xb1\x81\xe0\xb0\xb2\xe0\xb0\xbf\xe0\xb0\x95 \xe0\xb0\xb8\xe0\xb0\xbe\xe0\xb0\xa7\xe0\xb0\xbf\xe0\xb0\x82\xe0\xb0\x9a\xe0\xb0\xbf\xe0\xb0\xa8\xe0\xb0\xb5\xe0\xb1\x87.

English characters are parsed correct.

This is same on cmd as well as text file when I open them in notepad which has encoding set to 'utf-8'

How do I get all the non english characters? I am on windows 11.

CodePudding user response:

In one sense I am wondering if you are not understanding UTF-8 completely. Characters that are not "English" (I'm presuming you mean essentially ASCII) are still encoded in UTF-8 as 8 bit groups. To fit all of Unicode into 8 bit groups, many of them are going to be pushed into longer representations. Because Unicode represents most of the world's characters no matter what the language, encoding a string in UTF-8 just means that you will have a lot of characters that don't look "right" even if they are valid UTF-8. I'd suggest reading (of all things) the Wikipedia UTF-8 definition as a start.

Maybe I'm wrong (it happens) but it probably is that you are asking a simple conversion that has no English counterpart.

CodePudding user response:

print(tweet.content.encode('utf-8')) writes a byte string (data, not text) in a human-readable, ASCII-compatible form (leading b to represent a byte string, non-ASCII byte values >127 represented as hexadecimal escape codes \xNN) and is not what you want.

If using output redirection, Python can be told what encoding to use to convert text to a byte stream suitable for a file using an environment variable:

set PYTHONIOENCODING=utf8
python main.py > output.txt

You can also write the data directly to a file specifying the encoding instead of using redirection:

with open('tweet.txt','w',encoding='utf8') as f:
    f.write(tweet.content)
  • Related