cleaning a sentence from numbers, signs and other languages-CodePudding

I have a txt file that contain Japanese sentences. I would like to remove all non Japanese words. Such as numbers, English alphabets or any other non Japanese language, signs, symbols. Is there a quick way to do it? Thanks

Hi !こんにちは、私の給料は月額10000ドルです。 XO XO
私はあなたの料理が大好きです
私のフライトはAPX1999です。
私はサッカーの試合を見るのが大好きです。

Words to remove : Hi ! XO XO 10000 APX1999

CodePudding user response：

Python 3.7 has the isascii() function for str types. This code will remove ascii characters (not necessarily what is being asked) but may help to suggest a strategy.

with open('japanese.txt') as infile:
    print(''.join([c for c in infile.read() if c == '\n' or not c.isascii()]))

CodePudding user response：

The simplest way is this:

s = "Hi !こんにちは、私の給料は月額10000ドルです。 XO XO 私はあなたの料理が大好きです私のフライトはAPX1999です。私はサッカーの試合を見るのが大好きです"

no_ascii = ''
for c in s:
    ascii_code = ord(c)
    if ascii_code > 127 or ascii_code == 0:
        no_ascii  = c

print(no_ascii)
こんにちは、私の給料は月額ドルです。私はあなたの料理が大好きです私のフライトはです。私はサッカーの試合を見るのが大好きです

CodePudding user response：

import re
import string
s = '''Hi !こんにちは、私の給料は月額10000ドルです。 XO XO
私はあなたの料理が大好きです
私のフライトはAPX1999です。
私はサッカーの試合を見るのが大好きです。
'''
# replace all ascii chars 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()* ,-./:;<=>?@[\]^_`{|}~
replaced = re.sub(f'[{string.printable}]', '', s)
print(replaced)

Output

こんにちは、私の給料は月額ドルです。私はあなたの料理が大好きです私のフライトはです。私はサッカーの試合を見るのが大好きです。