Home > database >  Python FTP return corrupted name of file: じし->し ゙し (how can i properly decode/encode a string)
Python FTP return corrupted name of file: じし->し ゙し (how can i properly decode/encode a string)

Time:05-07

In my open source application, I use rented hosting with FTP. My application needs to read a list of files from a directory and parse it. However, some of the files have erroneous names. How can I recover the names or ask FTP to give them out correctly.

import ftplib

ftp_domain = "japcards.ru"
ftp_login = "u1670424_jap_db"
ftp_pass = "Jap2DbPass"

if __name__ == '__main__':
    ftp = ftplib.FTP(ftp_domain)
    ftp.encoding = 'utf-8'
    ftp.login(ftp_login, ftp_pass)
    ftp.cwd("audio/jp")
    ftpList = ftp.nlst()
    ftpList.sort()
    for i in ftpList:
        print(i)
        print(i.encode('utf-8'))
        

From the output of the example:

し ゙しょけい.wav
b'\xe3\x81\x97\xe3\x82\x99\xe3\x81\x97\xe3\x82\x87\xe3\x81\x91\xe3\x81\x84.wav'

CodePudding user response:

This appears to just be an issue with how your OS / STDOUT is handling the characters.

As you have shown with the encoded output, the character じ is represented by \xE3\x81\x97\xE3\x82\x99 and when you inspect the variable in memory, those bytes are treated as a single UTF-8 character.

If you paste the "desired" text into your IDE, PyCharm Windows in my case, the \xE3\x81\x97\xE3\x82\x99 is replaced with \xE3\x81\x97\x20\xE3\x82\x99

Outside of printing on the screen, this issue shouldn't affect the rest of your code. As in memory the file names are preserved fine.

CodePudding user response:

Correct answer: you need to run the script with a version of at least Python 3.10.4. For compare strings (dublicate from web site):

$str = "がぎぐげござじずぜぞだぢづでど・・・";
 
# Hiragana
$str =~ s/\xE3\x81\x8B\xE3\x82\x99/\xE3\x81\x8C/g;  # か ○゛=> が
$str =~ s/\xE3\x81\x8D\xE3\x82\x99/\xE3\x81\x8E/g;  # き ○゛=> ぎ
$str =~ s/\xE3\x81\x8F\xE3\x82\x99/\xE3\x81\x90/g;  # く ○゛=> ぐ
$str =~ s/\xE3\x81\x91\xE3\x82\x99/\xE3\x81\x92/g;  # け ○゛=> げ
$str =~ s/\xE3\x81\x93\xE3\x82\x99/\xE3\x81\x94/g;  # こ ○゛=> ご
$str =~ s/\xE3\x81\x95\xE3\x82\x99/\xE3\x81\x96/g;  # さ ○゛=> ざ
$str =~ s/\xE3\x81\x97\xE3\x82\x99/\xE3\x81\x98/g;  # し ○゛=> じ
$str =~ s/\xE3\x81\x99\xE3\x82\x99/\xE3\x81\x9A/g;  # す ○゛=> ず
$str =~ s/\xE3\x81\x9B\xE3\x82\x99/\xE3\x81\x9C/g;  # せ ○゛=> ぜ
$str =~ s/\xE3\x81\x9D\xE3\x82\x99/\xE3\x81\x9E/g;  # そ ○゛=> ぞ
$str =~ s/\xE3\x81\x9F\xE3\x82\x99/\xE3\x81\xA0/g;  # た ○゛=> だ
$str =~ s/\xE3\x81\xA1\xE3\x82\x99/\xE3\x81\xA2/g;  # ち ○゛=> ぢ
$str =~ s/\xE3\x81\xA4\xE3\x82\x99/\xE3\x81\xA5/g;  # つ ○゛=> づ
$str =~ s/\xE3\x81\xA6\xE3\x82\x99/\xE3\x81\xA7/g;  # て ○゛=> で
$str =~ s/\xE3\x81\xA8\xE3\x82\x99/\xE3\x81\xA9/g;  # と ○゛=> ど
$str =~ s/\xE3\x81\xAF\xE3\x82\x99/\xE3\x81\xB0/g;  # は ○゛=> ば
$str =~ s/\xE3\x81\xAF\xE3\x82\x9A/\xE3\x81\xB1/g;  # は ○゜=> ぱ
$str =~ s/\xE3\x81\xB2\xE3\x82\x99/\xE3\x81\xB3/g;  # ひ ○゛=> び
$str =~ s/\xE3\x81\xB2\xE3\x82\x9A/\xE3\x81\xB4/g;  # ひ ○゜=> ぴ
$str =~ s/\xE3\x81\xB5\xE3\x82\x99/\xE3\x81\xB6/g;  # ふ ○゛=> ぶ
$str =~ s/\xE3\x81\xB5\xE3\x82\x9A/\xE3\x81\xB7/g;  # ふ ○゜=> ぷ
$str =~ s/\xE3\x81\xB8\xE3\x82\x99/\xE3\x81\xB9/g;  # へ ○゛=> べ
$str =~ s/\xE3\x81\xB8\xE3\x82\x9A/\xE3\x81\xBA/g;  # へ ○゜=> ぺ
$str =~ s/\xE3\x81\xBB\xE3\x82\x99/\xE3\x81\xBC/g;  # ほ ○゛=> ぼ
$str =~ s/\xE3\x81\xBB\xE3\x82\x9A/\xE3\x81\xBD/g;  # ほ ○゜=> ぽ
 
# Katakana
$str =~ s/\xE3\x82\xAB\xE3\x82\x99/\xE3\x82\xAC/g;  # カ ○゛=> ガ
$str =~ s/\xE3\x82\xAD\xE3\x82\x99/\xE3\x82\xAE/g;  # キ ○゛=> ギ
$str =~ s/\xE3\x82\xAF\xE3\x82\x99/\xE3\x82\xB0/g;  # ク ○゛=> グ
$str =~ s/\xE3\x82\xB1\xE3\x82\x99/\xE3\x82\xB2/g;  # ケ ○゛=> ゲ
$str =~ s/\xE3\x82\xB3\xE3\x82\x99/\xE3\x82\xB4/g;  # コ ○゛=> ゴ
$str =~ s/\xE3\x82\xB5\xE3\x82\x99/\xE3\x82\xB6/g;  # サ ○゛=> ザ
$str =~ s/\xE3\x82\xB7\xE3\x82\x99/\xE3\x82\xB8/g;  # シ ○゛=> ジ
$str =~ s/\xE3\x82\xB9\xE3\x82\x99/\xE3\x82\xBA/g;  # ス ○゛=> ズ
$str =~ s/\xE3\x82\xBB\xE3\x82\x99/\xE3\x82\xBC/g;  # セ ○゛=> ゼ
$str =~ s/\xE3\x82\xBD\xE3\x82\x99/\xE3\x82\xBE/g;  # ソ ○゛=> ゾ
$str =~ s/\xE3\x82\xBF\xE3\x82\x99/\xE3\x83\x80/g;  # タ ○゛=> ダ
$str =~ s/\xE3\x83\x81\xE3\x82\x99/\xE3\x83\x82/g;  # チ ○゛=> ヂ
$str =~ s/\xE3\x83\x84\xE3\x82\x99/\xE3\x83\x85/g;  # ツ ○゛=> ヅ
$str =~ s/\xE3\x83\x86\xE3\x82\x99/\xE3\x83\x87/g;  # テ ○゛=> デ
$str =~ s/\xE3\x83\x88\xE3\x82\x99/\xE3\x83\x89/g;  # ト ○゛=> ド
$str =~ s/\xE3\x83\x8F\xE3\x82\x99/\xE3\x83\x90/g;  # ハ ○゛=> バ
$str =~ s/\xE3\x83\x8F\xE3\x82\x9A/\xE3\x83\x91/g;  # ハ ○゜=> パ
$str =~ s/\xE3\x83\x92\xE3\x82\x99/\xE3\x83\x93/g;  # ヒ ○゛=> ビ
$str =~ s/\xE3\x83\x92\xE3\x82\x9A/\xE3\x83\x94/g;  # ヒ ○゜=> ピ
$str =~ s/\xE3\x83\x95\xE3\x82\x99/\xE3\x83\x96/g;  # フ ○゛=> ブ
$str =~ s/\xE3\x83\x95\xE3\x82\x9A/\xE3\x83\x97/g;  # フ ○゜=> プ
$str =~ s/\xE3\x83\x98\xE3\x82\x99/\xE3\x83\x99/g;  # ヘ ○゛=> ベ
$str =~ s/\xE3\x83\x98\xE3\x82\x9A/\xE3\x83\x9A/g;  # ヘ ○゜=> ペ
$str =~ s/\xE3\x83\x9B\xE3\x82\x99/\xE3\x83\x9C/g;  # ホ ○゛=> ボ
$str =~ s/\xE3\x83\x9B\xE3\x82\x9A/\xE3\x83\x9D/g;  # ホ ○゜=> ポ
  • Related